1 Introduction

1.1 Background and motivation

Human resources are the most valuable asset in any country (Peretti, 2004). They are the main reason behind the success or the failure of any organization. In fact, having an educated and competent manpower is the key driver to economic and social development. In this context, the importance of academic education has become undeniable. Therefore, it is crucial to invest money and time in order to study students’ academic performance and figure out effective ways to improve it.

Given the importance of the topic, it has been given particular attention in past research. In fact, many studies have been conducted in order to analyze the factors impacting students’ academic performance. While some studies focused on the psychological variables, such as Franck Amadieu & André Tricot’s research (Amadieu and Tricot, 2022), other researchers have been interested in the impact of other elements such as mobility (Richard, 2009), gender and other socio-economic factors on students’ academic success.

Many reasons motivated us to choose this topic of research. In fact, as students, we are very passionate about the educational field. Thus, we want to provide through this project a detailed analysis that can be used as a reference guide for leaders working in the educational field. Mainly, we want to help schools and universities to have a better understanding of the factors influencing students’ academic performance in order to improve their decision-making processes, students’ success rate and eventually their overall organization.

1.2 Project objectives

The aim of the project is to understand the evolution of secondary academic performance in France. Our study will mainly focus on 3rd grade students (equivalent to 11th grade in Switzerland) and their results on the Diplôme National du Brevet (DNB) by school.

First, we will observe whether there are improvements or, on the contrary, deterioration in admissions of DNB over the years. From this dataset, we will also make comparisons, particularly at the geographical level, and an analysis of the success rate in terms of distinction for each school.

Then, we will try to understand if there is a correlation between academic success and some socio-economic factors, such as the type of accommodation, the single-parent families rate, and the involvement of schools in students’ physical and sports practice. Finally, despite these factors, we will investigate whether the COVID-19 pandemic has had a direct negative impact on students’ school performance.

1.3 Research questions

  • What is the evolution of student performance over time and across the different regions/departments of France?

  • Do socio-economic factors such as the type of accommodation, family situation or college policies have an influence on student success ?

  • Has the COVID-19 pandemic impacted student performance?

2 Data Sets

2.1 dnb_results

This dataset presents the results of the “diplôme national du brevet” by school, for schools in metropolitan France and for the overseas departments and regions. This data set contains 139’580 observations.

Variable Meaning
session Year of the exam session
school_id School identification number
school_type School type divided in six categories: COLLEGE, LYCEE PROFESSIONNEL, LYCEE, EREA, CFA, and AUTRE
establishment_name Name of the establishment
education_sector Education sector categorised as public or private
municipality_code Municipality code
municipality Name of the municipality
department_code Department code. It is to be noted that France has 101 departements.
department Name of the department
academy_code Academy code
academy_name Name of the academy
region_code Region code. It is to be noted that France has 18 administrative, regions
region Name of the region
registered Registered candidates
present Candidates present for the exam
admitted Candidates admitted
admitted_without Candidates admitted without distinction
admitted_AB Candidates admitted with distinction “Assez Bien”
admitted_B Candidates admitted with distinction “Bien”
admitted_TB Candidates admitted with distinction “Très bien”
success_rate “Success rate [Present]/[Admis] as a percentage”

Source of the data set

2.2 Generation 2024 data set

This data set gathers all schools which have been awarded the “Generation 2024” label. The objective of this label, developed in view of the Paris 2024 Olympic Games, is to develop bridges between the school world and the sports movement in order to encourage young people to take part in physical activity and sport. This data set contains 6’883 observations.

Variable Meaning
region Name of the region
academy Name of the academy
department Name of the department
municipality Name of the municipality
establishment Name of the establishment
school_id School identification number
school_type School type
education_sector Education sector categorised as public or private
postcode Postcode
adress Address of the establishment
adress_2 Additional address of the establishment
mail E-mail address of the establishment
students Number of students in the school
priority_education Indicates whether the school is located in a priority education network (REP) or a reinforced priority education network (REP+)
city_school Indicates whether the school is part of a city school
QPV Position relative to a priority neighbourhood of the city policy. It is a policy aimed at compensating for differences in living standards with the rest of the territory.
ULIS “Indicates whether the school offers a ULIS (Localized Unit for School Inclusion)”
SEGPA “Indicates whether the school has a SEPGA (adapted general and vocational education sections)”
sport_section Indicates whether the school has a sports section
agricultural_high_school Indicates whether the school is an agricultural high school
military_high_school Indicates whether the school is a military high school
vocational_high_school Indicates whether the establishment is labeled “vocational high school”
establishment_web Url of the description of the establishment page on the ONISEP website
SIREN_SIRET “SIREN/SIRET number of the establishment. SIREN is for Business Register Identification System in french.”
district Name of the district to which the school is attached
ministry Ministry responsible for the institution
label_start_date Start date of the “generation 2024” label. Format yyyy/mm/dd
label_end_date End date of the “generation 2024” label. Format yyyy/mm/dd
y_coordinate Y coordinate of the establishment, using the EPSG coordinate system
x_coordinate X coordinate of the establishment, using the EPSG coordinate system
epsg EPSG code of the coordinate system used to locate the establishment
precision_on_localisation Specification of the geographical location of the establishment
latitude Latitude
longitude Longitude
position Geographical position
engaging_30_sport Indicates whether the institution participates in the 30 minutes of daily physical activity programme

Source of the data set

2.3 Student housing Data set

This dataset records enrolment in secondary schools according to the type of accommodation for pupils: half-board, boarding school etc. This data set contains 32’096 observations.

Variable Meaning
year_back_to_school Year of the start of the school year
region Name of the academic region
academy Name of the academy
department Name of the department
municipality Name of the municipality
number School identification number
establishment_main_name Main name of the establishment
establishment_name Name of the establishment
school_type School type
education_sector Education sector categorised as public or private
students_secondary_education Students in secondary education
students_higher_education Number of students in higher education
external_students_secondary_education External students in secondary education
half_boarders_students_secondary_education Half-boarders in secondary education
boarding_students_secondary_education Boarding students in secondary education
external_students_higher_education External students in higher education
half_board_students_higher_education Half-board students in higher education
boarding_students_higher_education Boarding students in higher education

Source of the data set

2.4 Single-parent families dataset

This data set provides information about the single-parent families in each municipality. The census is made two years in a row every five years since 2007. This data set contains 606 observations.

Variable Meaning
geocode Departmental code from INSEE
libgeo Name of department
year Census year
sing_par Number of single-parent families

Source of the data set

2.5 Covid Data set

This is a time based data set that gives us information on the COVID 19 tests and results carried out by laboratories, hospitals, pharmacists, doctors and nurses. It is updated daily. On the 30th November, the data set contained 82’394 observations.

Variable Meaning
department_code Department code
test_week Week of the tests. Format yyyy-Sww
educational_level Description of the age group as [m-n], m and n being the lower and upper limits.
age_group Denomination of the age group. n-1 is used in this case excepet for the oldest group where 18 is used
pop Population
positive Weekly patients testing positive
tested Weekly patients tested
incidence_rate Incidence rate
positivity_rate Positivity rate
screening_rate Screening rate

Source of the data set

Loading of the data All but the single_parent dataset are CSV files with semicolons as separators. The single_parent data set is in excel format, so we have to use read_excel. We used “skip = 4” because the document includes extra header information rows.

DNB_par_etablissement <- read_delim(here::here("data/DNB-par-etablissement.csv"), ";", escape_double = FALSE, trim_ws = TRUE)
Etablissements_labellises_generation_2024 <- read_delim(here::here("data/Etablissements-labellises-generation-2024.csv"),";", escape_double = FALSE, trim_ws = TRUE)
Hebergement_eleves_etablissements_2d <- read_delim(here::here("data/Hebergement-eleves-etablissements-2d.csv"), ";", escape_double = FALSE, trim_ws = TRUE)
insee_rp_hist_xxxx <- read_excel(here::here("data/insee_rp_hist_xxxx.xlsx"), skip = 4)
covid_sp_dep_heb_cage_scol_2022_11_30_19h01 <- read_delim(here::here("data/covid_sp_dep_heb_cage_scol_2022_11_30_19h01.csv"), ";", escape_double = FALSE, trim_ws = TRUE)

3 Data Wrangling

We have realised that some wrangling are necessary for each data sets. We have established a checklist that we will go through for each data set. We have to :

  1. Translate the column names. As we have to rename all data sets in the same way, we have created a function. The function has a data frame and a vector as inputs. It checks if the length of the vector is correct, if so it returns a tibble with the column names renamed. Otherwise, the function gives an error message stating that the vector is not the right length.

rename_df <- function(df, x){
  if (ncol(df) == length(x)){
    names(df) <- c(x)
    df <- as_tibble(df)
  } else {
    stop("Vector is not the right length")
  }
}
  1. Make sure that all data are of the right type.
  2. Make sure that the time reference (year) are all aligned with the exam session.
  3. Add a column department_fr which will be harmonized between all data set in order to join them easily. We have also decided to keep only establishments in mainland France and Corsica. We wanted to focus on this part of France in order to get more comparable results. We note that including the results from all the overseas region would be interesting for further researches.
  4. and more. We need to make sure that the data set does not need any further specific wrangling.

3.1 dnb_results

  1. Translate the column names.

dnb_colnames <- c("session", "school_id", "school_type", "establishment_name", "education_sector", "municipality_code", "municipality", "department_code", "department", "academy_code", "academy_name", "region_code", "region", "registered", "present", "admitted", "admitted_without", "admitted_AB", "admitted_B", "admitted_TB", "success_rate_pct"
)
dnb_results <- rename_df(DNB_par_etablissement, dnb_colnames)
  1. success_rate is of the form xx,xx% we want it as a double of the form xx.xx

dnb_results[["success_rate_pct"]] <- as.double(gsub("%","",
                                               gsub(",",".", dnb_results[["success_rate_pct"]])))
  1. We need to harmonize the year variables of all the other data sets to match the logic of this one. The year is the year of the exam session (e.g academic period “2020-2021” is represented as 2021)
  2. We need to add the column department_fr and drop the overseas collectivities (COM).

dnb_results$department_fr <- stri_trans_general(dnb_results$department, "Latin-ASCII") %>%
  str_to_title(.) %>% 
  gsub("Du", "du", .) %>% 
  gsub("De", "de", .) %>% 
  gsub("D'", "D", .) %>%
  gsub("Et", "et", .) %>%
  gsub(" ", "-", .) %>%
  str_replace_all("Corse-du-Sud", "Corse du Sud") %>% 
  str_replace_all("deux-Sevres", "Deux-Sevres") %>% 
  str_replace_all("Alpes-de-Hte-Provence", "Alpes-de-Haute-Provence") %>%       
  str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>% 
  str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")

dnb_results <- dnb_results %>% 
dplyr::filter(!department_fr %in% c("Polynesie-Française","Guyane", "Martinique", "Guadeloupe", "La-Reunion", "Mayotte", "NA", "-"))
  1. We want to know the attribution rate of each distinction. It is needed in order to do some true comparisons between region without the same number of students.
dnb_results <- dnb_results %>% 
  mutate(without_pct = admitted_without/admitted*100,
         AB_pct = admitted_AB/admitted*100,
         B_pct = admitted_B/admitted*100,
         TB_pct = admitted_TB/admitted*100
         )
  1. Other data sets used for the project only go down to the department level. We therefore have to summarize all values to the department level for further merges and analyses between data sets.
dnb_results_dep <- dnb_results %>% 
  select(session, department_code, region:TB_pct) %>% 
  group_by(department_fr, session, region) %>% 
  summarise(registered = sum(registered),
            present = sum(present),
            admitted = sum(admitted),
            admitted_without = sum(admitted_without),
            admitted_AB = sum(admitted_AB),
            admitted_B = sum(admitted_B),
            admitted_TB = sum(admitted_TB),
            without_pct = mean(without_pct, na.rm = TRUE),
            AB_pct = mean(AB_pct, na.rm = TRUE),
            B_pct = mean(B_pct, na.rm = TRUE),
            TB_pct = mean(TB_pct, na.rm = TRUE),
            success_rate_pct = mean(success_rate_pct, na.rm = TRUE))

We can see the summarized data set dnb_results_dep below.

3.2 establishment_24

  1. Translate the column names.
est_24_names <- c("region", "academy", "department", "municipality", "establishment", "school_id", "school_type", "education_sector", "postcode", "adress", "adress_2", "mail", "students", "priority_education", "city_school", "QPV", "ULIS", "SEGPA", "sport_section", "agricultural_high_school", "military_high_school", "vocational_high_school", "establishment_web", "SIREN_SIRET", "district", "ministry", "label_start_date", "label_end_date", "y_coordinate", "x_coordinate", "epsg", "precision_on_localisation", "latitude", "longitude", "position", "engaging_30_sport")
establishment_24 <- rename_df(Etablissements_labellises_generation_2024, est_24_names)
  1. No problem for this data set
  2. We need to add two variables session_started and session_ended. Indeed as the label has a start and an End date we have to trace the first session and the last session where the establishment have the label generation 2024.Most labellisations start and end in January but a few start and end in middle of the year. Exams take place end of June, beginning of July. Therefore, we will consider labellization done in August and after as done for the next academic year.

establishment_24 <- establishment_24 %>% 
  mutate(session_started = case_when(month(label_start_date) <= 7 ~ year(label_start_date),
                                     month(label_start_date) >  7 ~ year(label_start_date)+1),
         session_ended = case_when(month(label_end_date) <= 7 ~ year(label_end_date),
                                   month(label_end_date) >  7 ~ year(label_end_date)+1)
         )
  1. We need to add the column department_fr.

establishment_24$department_fr <- stri_trans_general(establishment_24$department, "Latin-ASCII") %>%
  str_to_title(.) %>% 
  gsub("Du", "du", .) %>% 
  gsub("De", "de", .) %>% 
  gsub("D'", "D", .) %>%
  gsub("Et", "et", .) %>%
  gsub(" ", "-", .) %>%
  str_replace_all("Corse-du-Sud", "Corse du Sud") %>% 
  str_replace_all("deux-Sevres", "Deux-Sevres") %>% 
  str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>% 
  str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")

We can see on the map below, that the data set contains establishments from the overseas collectivities (COM) but from the French international schools as well.

As previsouly discussed we have decided to keep only data from mainland France. We had to make sure that we also removed the French international schools. We took the opportunity to remove unused variables.

establishment_24 <- establishment_24 %>% 
  dplyr::filter(!department_fr %in% c("Polynesie-Francaise","Guyane", "Martinique", "Guadeloupe", "La-Reunion", "Mayotte", "Saint-Martin", "-")) %>% 
  dplyr::filter(!department_fr == "NA")#"NA" and "-" makes sure that we have no more International schools. 

#establishment_24 has a lot of variables which we will for sure not use
establishment_24 <- establishment_24 %>% 
  select(-c(postcode:mail,city_school,QPV:SEGPA,establishment_web:ministry, precision_on_localisation))
  1. The “Diplome National du Brevet” is the diploma received at the end of “collège”. We have to keep establishments which are “collège” and drop all the rest.

  2. Other data sets only go down to the department level. We therefore have to create a simplified version for further merges between data sets. We will summarize by counting the number of establishment created each session by department. The end of the label period is not of interest for us as the first end date is on the 9.01.2021 which is categorised as session 2022 and we have results for dnb up until session 2021.

establishment_24_dep <- establishment_24 %>% 
  select(region, department_fr, session_started) %>% 
  group_by(department_fr, session_started) %>% 
  summarise(establishment = n())

We can see the summarized data set establishment_24_dep below.

3.3 student_housing

  1. Translate the column names.

housing_names <- c("year_back_to_school", "region", "academy", "department", "municipality", "school_id", "establishment_main_name", "establishment_name", "school_type", "education_sector", "students_secondary_education", "students_higher_education", "external_students_secondary_education", "half_boarders_students_secondary_education", "boarding_students_secondary_education", "external_students_higher_education", "half_board_students_higher_education", "boarding_students_higher_education")
student_housing <- rename_df(Hebergement_eleves_etablissements_2d, housing_names)
  1. No Problem for this data set
  2. We need to create a session variable as year_back_to_school refers to the beginning of the school year and not the exam session.

student_housing <- student_housing %>% 
  mutate(session = year_back_to_school + 1) %>% 
    select(year_back_to_school,session, everything()) #here just to order variables
  1. We need to add the column department_fr and remove the departments outside mainland France.
student_housing$department_fr <- stri_trans_general(student_housing$department, "Latin-ASCII") %>%
  str_to_title(.) %>% 
  gsub("Du", "du", .) %>% 
  gsub("De", "de", .) %>% 
  gsub("D'", "D", .) %>%
  gsub("Et", "et", .) %>%
  gsub(" ", "-", .) %>%
  str_replace_all("Corse-du-Sud", "Corse du Sud") %>% 
  str_replace_all("deux-Sevres", "Deux-Sevres") %>% 
  str_replace_all("Alpes-de-Hte-Provence", "Alpes-de-Haute-Provence") %>%       
  str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>% 
  str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")

student_housing <- student_housing %>% 
  dplyr::filter(!department_fr %in% c("Polynesie-Francaise","Guyane", "Martinique", "Guadeloupe", "La-Reunion", "Mayotte", "Saint-Martin", "-")) %>% 
  dplyr::filter(!department_fr == "NA")
  1. We are not interested by students from higher education and therefore can delete their variables.
student_housing <- student_housing %>% 
  select(-c(contains("higher")))
  1. We want to know the rate of each housing offering. It is needed in order to do some true comparisons between region without the same number of students.
student_housing <- student_housing %>% 
  mutate(external_students_rate = external_students_secondary_education/students_secondary_education*100, 
         half_boarders_students_rate = half_boarders_students_secondary_education/students_secondary_education*100,
         boarding_students_rate = boarding_students_secondary_education/students_secondary_education*100)
  1. Other data sets only go down to the department level. We therefore have to create a simplified version for further merges between data sets. We will summarize all values to the department level.
housing_dep <- student_housing %>% 
  select(session, region, students_secondary_education:boarding_students_rate) %>% 
  group_by(session, department_fr) %>% 
  summarise(external_students_secondary_education = sum(external_students_secondary_education, na.rm = TRUE),
            half_boarders_students_secondary_education = sum(half_boarders_students_secondary_education, na.rm = TRUE), 
            boarding_students_secondary_education = sum(boarding_students_secondary_education, na.rm = TRUE),
            students_secondary_education = sum(students_secondary_education, na.rm = TRUE),
            external_students_rate = mean(external_students_rate, na.rm = TRUE),
            half_boarders_students_rate = mean(half_boarders_students_rate, na.rm = TRUE), 
            boarding_students_rate = mean(boarding_students_rate, na.rm = TRUE))

We can see the summarized data set housing_dep below.

3.4 single_parent

  1. Translate the column names.
sg_parent_names <- c("geocode", "department", "session","sing_par")
single_parent <- rename_df(insee_rp_hist_xxxx, sg_parent_names)
  1. We have to change the scalar of the “session” and “single_par” variables into doubles.
single_parent[["session"]]<- as.double(single_parent[["session"]])
single_parent[["sing_par"]]<- as.double(single_parent[["sing_par"]])
  1. The session variable is already implemented in this data set.
  2. We need to add the column department_fr and remove the department outside of mainland France.
single_parent$department_fr <- stri_trans_general(single_parent$department, "Latin-ASCII") %>%
  str_to_title(.) %>%
  gsub("Du", "du", .) %>%
  gsub("De", "de", .) %>%
  gsub("D'", "D", .) %>%
  gsub("Et", "et", .) %>%
  gsub(" ", "-", .) %>%
  str_replace_all("Corse-du-Sud", "Corse du Sud") %>%
  str_replace_all("deux-Sevres", "Deux-Sevres") %>%
  str_replace_all("Alpes-de-Hte-Provence", "Alpes-de-Haute-Provence") %>%      
  str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>%
  str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")

single_parent <- single_parent %>% 
dplyr::filter(!department_fr %in% c("Polynesie-Francaise","Guyane", "Martinique", "Guadeloupe", "La-Reunion", "Mayotte", "Saint-Martin", "-")) %>% 
  dplyr::filter(!department_fr == "NA")
  1. No need for further data wrangling for this data set

We can see the final table single_parent below.

3.5 covid_in_schools

  1. Translate the column names.
covide_names <- c("department_code", "test_week", "educational_level", "age_group", "pop", "positive", "tested", "incidence_rate", "positivity_rate", "screening_rate")
covid_in_schools <- rename_df(covid_sp_dep_heb_cage_scol_2022_11_30_19h01,covide_names)
  1. test_week will be treated in (3.). positive, incidence_rate and positivity_rate need to be doubles
covid_in_schools[["positive"]] <- as.double(gsub(",",".", covid_in_schools[["positive"]]))
covid_in_schools[["incidence_rate"]] <- as.double(gsub(",",".", covid_in_schools[["incidence_rate"]]))
covid_in_schools[["positivity_rate"]] <- as.double(gsub(",",".", covid_in_schools[["positivity_rate"]]))
  1. We need to create two new variables. The first will be the test_date categorizing each week. We chose the first day of the week. As we had the week number we had to select the week number and then for each year add seven days to the Monday of the first week of the year. The second variable is the session. A session is categorised from August to July of the next year. As our argument will be set on the month, we might have some test done the first days of august count towards the “wrong” session. The number of Covid cases in August are relatively low compared to the rest of the year and it represents at maximum 6 days of tests. Therefore we consider this margin of error to be satisfactory.
covid_in_schools <- covid_in_schools %>%
  mutate(test_date = case_when (as.numeric(substr(test_week, 1,4))== 2020
                                ~ lubridate::ymd('2019-12-30') + lubridate::weeks(as.numeric(substr(test_week, 7,8))),
                                as.numeric(substr(test_week, 1,4))== 2021
                                ~ lubridate::ymd('2021-01-04') + lubridate::weeks(as.numeric(substr(test_week, 7,8))),
                                as.numeric(substr(test_week, 1,4))== 2022
                                ~ lubridate::ymd('2022-01-03') + lubridate::weeks(as.numeric(substr(test_week, 7,8)))),
         session = case_when(month(test_date) <= 7 ~ year(test_date),
                             month(test_date) >  7 ~ year(test_date)+1))
                              
  1. Only department code. We need to input the column department_fr and region. We use the department_fr and region variables from dnb_results. We join the two datasets by department_code. To do this, we first need to match the two by removing the first character of the department_code from dnb_results.
reg_department <- dnb_results %>%
  select(c("department_code", "department_fr", "region")) %>%
  unique()

reg_department$department_code <- substring(reg_department$department_code, 2)
  
covid_in_schools <- right_join(x = covid_in_schools, y = reg_department, by = "department_code")
  1. We are only interested in the age group of 11 to 15. We use filter to keep only the students passing the DNB.
covid_in_schools <- covid_in_schools %>%
  filter(educational_level == "[11-15)")

We can see the final table covid_in_schools below.

3.6 Auxiliary data sets

We will use the ggplot France map for our visualizations

map <- map_data("france")

The region variable is in fact the departments. We rename it “department_fr” to fit in with the other data sets.

colnames(map)[5]<- "department_fr"

4 Exploratory data analysis

4.1 dnb_results

To explore this data set we have decided to start on a national level to analyse the global tendency. We will then go down a level to a regional analysis to compare the number of students and see which region performs better. An analysis at the regional level will then be performed to dig deeper into the success rate and the graduation rate for each mention. To be complete with our analysis, we will see the results by establishment for the best and worst performing establishments in 2020. We will use their results of 2006 in comparison.

4.1.1 National analysis

For this first visualisation we wanted to see the evolution of the DNB results over time in order to have an overview of the data we were working with. For this, we created a data set summarizing the results at the national level. We simplified our visualisation process by pivoting the table. During the process, we created four new variables Candidates, Number_of_students, Mention_type and Rate.

France_results <- dnb_results %>% 
  group_by(session) %>% 
  summarise(registered = sum(registered),
            present = sum(present),
            admitted = sum(admitted),
            admitted_without = sum(admitted_without),
            admitted_AB = sum(admitted_AB),
            admitted_B = sum(admitted_B),
            admitted_TB = sum(admitted_TB),
            without_pct = mean(without_pct, na.rm = TRUE),
            AB_pct = mean(AB_pct, na.rm = TRUE),
            B_pct = mean(B_pct, na.rm = TRUE),
            TB_pct = mean(TB_pct, na.rm = TRUE),
            success_rate_pct = mean(success_rate_pct, na.rm = TRUE)) %>% 
  pivot_longer(c(registered, present,contains("admitted")),
               names_to = "Candidates",
               values_to = "Number_of_students") %>% 
   pivot_longer(c(contains("pct")), 
               names_to = "Mention_type",
               values_to = "Rate")

For this first graph, we plotted the Number_of_students and grouped them by Candidates.


p <- France_results %>% 
  ggplot(aes(x = session, y = Number_of_students, group = Candidates, color = Candidates))+
  geom_line()+
  scale_color_viridis(discrete = TRUE) +
    ggtitle("National DNB statistics") +
    theme_ipsum() +
    ylab("Number of students")

ggplotly(p, tooltip = c("x" ,"y"))

According to the graph, we see that the number of admissions is increasing over the years. In fact, there is 704’742 admitted students in 2021 compared to only 572’236 in 2006. Though, we notice a slight decrease in 2018 and 2019. The positive trend resumed after that until 2021 and reached its peak in 2020 with 716’237 admissions. Similarly, the registration number follows the same tendency. In fact, 729’803 students had registered in 2006 compared to 801’721 students in 2021, for an increase of 9.85%, which is the highest record that has been stated so far. Also, the presence number broke the record in 2021 as it reached 795’209 students.

For the the analysis of the success rate and the rate of achievement of honours. We plotted the Rate and grouped them by Mention_type.


p <- France_results %>% 
  ggplot(aes(x = session, y = Rate, group = Mention_type, color = Mention_type))+
  geom_line()+
  scale_color_viridis(discrete = TRUE) +
    ggtitle("National DNB statistics") +
    theme_ipsum() +
    ylab("Rate in %")

ggplotly(p, tooltip = c("x" ,"y"))

The number of students that were admitted without a mention has been decreasing since 2006. It reached 284’774 students (53.85%) in 2006, compared to 154’095 students (25.11%) in 2021. The negative tendency calmed itself during the 2017-2021 period. As it is the case for all mentions, there is a complete shift in 2017. This will be analysed in further detail in the part one of our analysis. Contrary to the students graduating without honours, the number of students admitted with the TB distinction is showing a positive pattern. It grew from 31’563 (4.49%) in 2006 to reach 204’523 (23.98%) in 2021. The pattern for the two other distinctions are relatively similar as they remained stable up until 2017 but the change was not as drastic as for the first two distinction mentioned. The number of students that are admitted with a good mention show a positive pattern. In fact, the number has developed from 91’719 students (13.66%) in 2006 to 180’774 students (25.07%) in 2021. The number of students admitted with an AB distinction is increased slightly from 2006 to 2021. It developed from 164’180 (28%) students in 2006 to 165’350 students (25.83%) in 2021, representing a growth rate of 0.71%.

4.1.2 Regional Analysis

We started with a comparative analysis in order to compare the success rate between the regions. For this, we needed to create the charts below based on the dnb-results data set.

We created this bar plot in order to visualize the average success rate of each of the French regions. We used the function group_by in order to group our success rate data by region: We took the success_rate and the region, we computed the average success rate for each region and, we plotted them in the graph below.


p <- dnb_results %>% 
  select(success_rate_pct, region, department) %>% 
  group_by(region) %>% 
  summarise(success_rate = mean(success_rate_pct, na.rm = TRUE)) %>% 
  ggplot(aes(x = region, 
             y = success_rate, 
             fill = region)) + 
         geom_col() +
         scale_fill_viridis(discrete = TRUE) +
         labs(title = "Average success rate by region, 2006-2021", y = "Rate in %")+
         theme_ipsum() +
         theme(axis.title.x=element_blank(),
               axis.text.x=element_blank(),
               axis.ticks.x=element_blank())

ggplotly(p, tooltip = c("x" ,"y"))

On average, French regions have performed similarly over the years. However, the region of Bretagne took the lead, achieving a success rate of 89.8%. The region Nouvelle-Aquitaine followed with a success rate of 86.6%. On the third position, Auvergne-Rhône-Alpes recorded a success rate of 86.4%.

Number of students per result and region during the period 2006-2021

To have a clearer idea about the regions that performed the best, we created the following bar plot. It is based on summarized data to help us visualize the regions that recorded the highest number of admissions so far. Students’ distinction has once again been taken into consideration. Hence, we could have an idea about the regions that recorded the highest number of admissions for each of the distinctions until 2021.

Number of admitted by region

p <- dnb_results %>% 
 select(admitted, region) %>% 
 group_by(region) %>% 
 summarise(admitted = sum(admitted)) %>% 
 ggplot(aes(x = region,
            y = admitted, 
            fill = region)) + 
        geom_col() +
        scale_fill_viridis(discrete = TRUE) +
        theme_ipsum() +
        ggtitle("Number of admitted by region") +
        ylab("Number of students")+
        theme(axis.title.x=element_blank(),
              axis.text.x=element_blank(),
              axis.ticks.x=element_blank())

ggplotly(p, tooltip = c("x" ,"y"))
Admitted with zero mention

p <- dnb_results %>% 
 select(admitted_without, region) %>% 
 group_by(region) %>% 
 summarise(zero = sum(admitted_without)) %>% 
 ggplot(aes(x = region, 
            y = zero, 
            fill = region)) + 
        geom_col() +
        scale_fill_viridis(discrete = TRUE) +
        theme_ipsum() +
        ggtitle("Admitted with zero mention") +
        ylab("Number of students")+
        theme(axis.title.x=element_blank(),
              axis.text.x=element_blank(),
              axis.ticks.x=element_blank())

ggplotly(p, tooltip = c("x" ,"y"))
Admitted with mention AB

p <- dnb_results %>% 
 select(admitted_AB, region) %>% 
 group_by(region) %>% 
 summarise(AB = sum(admitted_AB)) %>% 
 ggplot(aes(x = region, 
            y = AB, 
            fill = region)) + 
        geom_col() +
        scale_fill_viridis(discrete = TRUE) +
        theme_ipsum() +
        ggtitle("Admitted with mention AB") +
        ylab("Number of students")+
        theme(axis.title.x=element_blank(),
              axis.text.x=element_blank(),
              axis.ticks.x=element_blank())

ggplotly(p, tooltip = c("x" ,"y"))
Admitted with mention B

p <- dnb_results %>% 
 select(admitted_B, region) %>% 
 group_by(region) %>% 
 summarise(B = sum(admitted_B)) %>% 
 ggplot(aes(x = region, 
            y = B, 
            fill = region)) + 
        geom_col() +
        scale_fill_viridis(discrete = TRUE) +
        theme_ipsum() +
        ggtitle("Admitted with mention B") +
        ylab("Number of students")+
        theme(axis.title.x=element_blank(),
              axis.text.x=element_blank(),
              axis.ticks.x=element_blank())

ggplotly(p, tooltip = c("x" ,"y"))
Admitted with mention TB

p <- dnb_results %>% 
 select(admitted_TB, region) %>% 
 group_by(region) %>% 
 summarise(TB = sum(admitted_TB)) %>% 
 ggplot(aes(x = region, 
            y = TB, 
            fill = region)) + 
        geom_col() +
        scale_fill_viridis(discrete = TRUE) +
        theme_ipsum() +
        ggtitle("Admitted with mention TB") +
        ylab("Number of students")+
        theme(axis.title.x=element_blank(),
              axis.text.x=element_blank(),
              axis.ticks.x=element_blank())

ggplotly(p, tooltip = c("x" ,"y"))

Based on the chart, Ile de France has recorded the highest number of admissions since 2006: 1’914’458 students got admitted in this region so far. The second position was assigned to Auvergne Rhône-Alpes, with a total number of admissions equal to 1’0276’655 students. On the third position, the region Hauts-de-France registered 99’0610 admissions since 2006. Also, Grand Est, Nouvelle-Aquitaine and Occitanic have performed similarly over the years, recording around 85’000 admissions over the period [2006-2021]. On the contrary, Corse has recorded the least number of admissions so far: Only 41’541 students were admitted in this region over the same time interval.

Taking the distinction into consideration, we notice that the regions with the highest number of admissions are the ones with the highest number of all the distinctions. As a matter of fact, Ile de France continues to lead the number of highest honour (TB), high honours (B), honours (AB) and standard pass (without) admissions.

The results are not surprising. Indeed, according to statistics, Ile de France is the most populated region, with an approximative number of 12’262’544 inhabitants. Thus, we expected it to take the lead in terms of the number of admissions.

Period 2006-2021 by region

In order to take the time component into account, a linear chart has been created: We used the group_by function in order to group success_rate by session and region. We relied on this chart to analyze the evolution of the success rate throughout the years for each of the regions. Also, it enabled us to take the admission’s distinction into consideration and thus, analyze the development of students’ performance in all French regions.

Success rate

It is to be noted that for a better visual interpretation the scale of the y-axis is adapted for each graph.


 p <- dnb_results %>% 
 select(success_rate_pct, region, session) %>% 
 group_by(region, session) %>% 
 summarise(success_rate = mean(success_rate_pct, na.rm = TRUE)) %>% 
 ggplot(aes(x = session, 
            y = success_rate, 
            color = region,
            text = region)) + 
        geom_line() +
        scale_color_viridis(discrete = TRUE) +
        theme_ipsum() +
        ggtitle("Success rate") +
        ylab("Rate in %")

ggplotly(p, tooltip = c("text","x" ,"y" ))
Admitted with zero mention

It is to be noted that for a better visual interpretation the scale of the y-axis is adapted for each graph.


 p <- dnb_results %>% 
 select(without_pct, region, session) %>% 
 group_by(region, session) %>% 
 summarise(zero = mean(without_pct, na.rm = TRUE)) %>% 
 ggplot(aes(x = session, 
            y = zero, 
            color = region,
            text = region)) + 
        geom_line() +
        scale_color_viridis(discrete = TRUE) +
        theme_ipsum() +
        ggtitle("Admitted with zero mention") +
        ylab("Rate in %")

ggplotly(p, tooltip = c("text","x" ,"y" ))
Admitted with mention AB

It is to be noted that for a better visual interpretation the scale of the y-axis is adapted for each graph.


 p <- dnb_results %>% 
 select(AB_pct, region, session) %>% 
 group_by(region, session) %>% 
 summarise(AB = mean(AB_pct, na.rm = TRUE)) %>% 
 ggplot(aes(x = session, 
            y = AB, 
            color = region,
            text = region)) + 
        geom_line() +
        scale_color_viridis(discrete = TRUE) +
        theme_ipsum() +
        ggtitle("Admitted with mention AB") +
        ylab("Rate in %")

ggplotly(p, tooltip = c("text","x" ,"y" ))
Admitted with mention B

It is to be noted that for a better visual interpretation the scale of the y-axis is adapted for each graph.


 p <- dnb_results %>% 
 select(B_pct, region, session) %>% 
 group_by(region, session) %>% 
 summarise(B = mean(B_pct, na.rm = TRUE)) %>% 
 ggplot(aes(x = session, 
            y = B, 
            color = region,
            text = region)) + 
        geom_line() +
        scale_color_viridis(discrete = TRUE) +
        theme_ipsum() +
        ggtitle("Admitted with mention B") +
        ylab("Rate in %")

ggplotly(p, tooltip = c("text","x" ,"y" ))
Admitted with mention TB

It is to be noted that for a better visual interpretation the scale of the y-axis is adapted for each graph.


 p <- dnb_results %>% 
 select(TB_pct, region, session) %>% 
 group_by(region, session) %>% 
 summarise(TB = mean(TB_pct, na.rm = TRUE)) %>% 
 ggplot(aes(x = session, 
            y = TB, 
            color = region,
            text = region)) + 
        geom_line() +
        scale_color_viridis(discrete = TRUE) +
        theme_ipsum() +
        ggtitle("Admitted with mention TB") +
        ylab("Rate in %")

ggplotly(p, tooltip = c("text","x" ,"y" ))

The standard pass admissions’ rate follows a negative pattern: It has been decreasing since 2006 for all the regions. It dropped from 57.7 % in 2006 to 24.9% in 2021 for Ile de France, from 49.6% to 24.4% for Nouvelle-Aquitaine and from 53.4 % to 22.6% for Auvergne-Rhône-Alpes.

In 2017 and 2020, the standard pass admission’s rate as well as the rate of admissions with honours dropped remarkably, decreasing suddenly in 2020 for all the regions. However, this tendency was disrupted in 2019, in which this number increased noticeably. In 2021, we noticed a reproduction of 2017’s positive pattern translating thus, a deterioration of students’ performance once again.

Conversely, the number of students succeeding with a great distinction (highest and high honours) has been increasing in all French regions since 2006, reaching its highest point in 2020. Similarly, we notice a strong negative evolution in 2019 that resumed in 2021.

4.1.3 Departmental Analysis

Following our national and regional analysis, we wanted to reduce the scope of our analysis further in order to have a better understanding of our data.

4.1.3.1 Box plot Analysis per department

The first step of this analysis is to visualize the performance of all French departments over the period [2006-2021]. In order to do that, the following box plot has been created. For this, we needed to group our data by department and session. Then, we plotted the success rate of all French departments over the period [2006-2021].


dnb_pct_dep <- dnb_results %>%
  group_by(department, session) %>% 
  summarise(AB_pct_dep = mean(AB_pct, na.rm = TRUE),
            B_pct_dep = mean(B_pct, na.rm = TRUE),
            TB_pct_dep = mean(TB_pct, na.rm = TRUE),
            without_pct_dep = mean(without_pct, na.rm = TRUE),
            success_rate_pct_dep = mean(success_rate_pct, na.rm = TRUE))
success_rate_pct

It is to be noted that for a better visual interpretation the scale of the y-axis is adapted for each graph.


p <- dnb_pct_dep %>% 
  ggplot(aes(x = session, 
             y = success_rate_pct_dep, 
             group = session, 
             fill = session, 
             text = department,
             text2 = success_rate_pct_dep)) +
        geom_boxplot()+
        geom_jitter(width = 0.25, alpha = 0.5)+ 
        scale_fill_gradientn(colors = viridis(16))+
        guides(fill = "none")+
        labs( x= "", y = "Success rate in %",
                title ="Success rate of each Department by session")

ggplotly(p, tooltip = c("text","text2"))
without_pct

It is to be noted that for a better visual interpretation the scale of the y-axis is adapted for each graph.


p <- dnb_pct_dep %>% 
  ggplot(aes(x = session, 
             y = without_pct_dep, 
             group = session, 
             fill = session, 
             text = department,
             text2 = without_pct_dep)) +
        geom_boxplot()+
        geom_jitter(width = 0.25, alpha = 0.5)+ 
        scale_fill_gradientn(colors = viridis(16))+
        guides(fill = "none")+
        labs( x= "", 
              y = "Rate of students with no mention in %",
              title ="Rate of students with no mention of each Department by session")

ggplotly(p, tooltip = c("text","text2"))
B_pct

It is to be noted that for a better visual interpretation the scale of the y-axis is adapted for each graph.


p <- dnb_pct_dep %>% 
  ggplot(aes(x = session, 
             y = B_pct_dep, 
             group = session, 
             fill = session, 
             text = department,
             text2 = B_pct_dep)) +
        geom_boxplot()+
        geom_jitter(width = 0.25, alpha = 0.5)+ 
        scale_fill_gradientn(colors = viridis(16))+
        guides(fill = "none")+
        labs( x= "", 
              y = "Rate of students with mention Bien in %",
              title ="Rate of students with mention Bien of each Department by session")

ggplotly(p, tooltip = c("text","text2"))
AB_pct

It is to be noted that for a better visual interpretation the scale of the y-axis is adapted for each graph.


p <- dnb_pct_dep %>% 
  ggplot(aes(x = session, 
             y = AB_pct_dep, 
             group = session, 
             fill = session, 
             text = department,
             text2 = AB_pct_dep)) +
        geom_boxplot()+
        geom_jitter(width = 0.25, alpha = 0.5)+ 
        scale_fill_gradientn(colors = viridis(16))+
        guides(fill = "none")+
        labs( x= "", 
              y = "Rate of students with mention Assez Bien in %",
              title ="Rate of students with mention Assez Bien of each Department by session")

ggplotly(p, tooltip = c("text","text2"))
TB_pct

It is to be noted that for a better visual interpretation the scale of the y-axis is adapted for each graph.


p <- dnb_pct_dep %>% 
  ggplot(aes(x = session, 
             y = TB_pct_dep, 
             group = session, 
             fill = session, 
             text = department,
             text2 = TB_pct_dep)) +
        geom_boxplot()+
        geom_jitter(width = 0.25, alpha = 0.5)+ 
        scale_fill_gradientn(colors = viridis(16))+
        guides(fill = "none")+
        labs( x= "", 
              y = "Rate of students with mention Très Bien in %",
              title ="Rate of students with mention Très Bien of each Department by session")

ggplotly(p, tooltip = c("text","text2"))

Based on our data analysis, the success rate is following a positive pattern. It has been increasing since 2006. In the same way, we notice an improvement of the students’ success rate in 2017 as well as a remarkable decline in 2019. A boost in the students’ performance was recorded in 2021 in which, the success rate in all departments has reached its highest level since 2006.

Also, the number of students succeeding with honours, or a standard pass distinction has been decreasing since 2006 for all departments. It remarkably decreased in 2017 compared to the previous years, reaching its lowest level in 2020. However, the negative pattern resumed in 2021, in which the number of students graduating with honours or with a standard pass distinction started to increase.

In the same context, the number of students graduating with highest or high honours has been increasing since 2006 for all French departments. It peculiarly heightened in 2017 to reach its highest point in 2020, which reflects a noticeable improvement in the students’ performance.

Moreover, we can see that the distribution remains stable over time for all ditinctions apart for the distinction TB. Indeed, up until 2017, the variance was very small with Q1 and Q3 being distant of around 2 percentage points. From 2017 onwards, it increased to nearly 5 percentage points.

Box plot Analysis of at the Department level

The following box plot was created in order to compare the performance of the best and “worst” department. It allowed us to visualize the difference between the highest achieving department, Paris in 2020 and one of the least achieving one, Eure et Loir. We chose to display the results of each establishment to see the dispersion in the results within one department.

We plotted the different success rates of the Paris department for the year 2006 and 2020 using the filter function. We proceeded the same way to plot the success rates of the Eure et Loir region over the same period.

Paris 2020

Paris <- dnb_results %>%
  select(school_id,establishment_name,department,session, contains("pct")) %>% 
  filter(department == "PARIS", session == "2020") %>% 
  pivot_longer(c(contains("pct")), # pivot longer to allow for a clean and easy boxplot graph with each pct
               names_to = "Mention_type",
               values_to = "Rate")
Paris$Mention_type <- factor(Paris$Mention_type, levels = c("success_rate_pct","without_pct", "AB_pct", "B_pct", "TB_pct")) #creation of factor to order the graph

p <- Paris %>% 
 ggplot(aes(x = Mention_type, 
             y = Rate, 
             fill = Mention_type, 
             text = establishment_name,
             text2 = Rate)) +
        geom_boxplot()+
        geom_jitter(width = 0.25, alpha = 0.5)+
        guides(fill = "none")+
        scale_fill_viridis(discrete = TRUE) +
        theme_ipsum() +
        labs( x= "", 
              y = "Rate in %",
              title ="Results for Parisian establishments in 2020")

ggplotly(p, tooltip = c("text","text2"))
Paris 2006

Paris <- dnb_results %>%
  select(school_id,establishment_name,department,session, contains("pct")) %>% 
  filter(department == "PARIS", session == "2006") %>% 
  pivot_longer(c(contains("pct")), # pivot longer to allow for a clean and easy boxplot graph with each pct
               names_to = "Mention_type",
               values_to = "Rate")
Paris$Mention_type <- factor(Paris$Mention_type, levels = c("success_rate_pct","without_pct", "AB_pct", "B_pct", "TB_pct")) #creation of factor to order the graph

p <- Paris %>% 
 ggplot(aes(x = Mention_type, 
             y = Rate, 
             fill = Mention_type, 
             text = establishment_name,
             text2 = Rate)) +
        geom_boxplot()+
        geom_jitter(width = 0.25, alpha = 0.5)+
        guides(fill = "none")+
        scale_fill_viridis(discrete = TRUE) +
        theme_ipsum() +
        labs( x= "", 
              y = "Rate in %",
              title ="Results for Parisian establishments in 2006")

ggplotly(p, tooltip = c("text","text2"))
Eure et Loir 2020

Eure <- dnb_results %>%
  select(school_id,establishment_name,department,session, contains("pct")) %>% 
  dplyr::filter(department == "EURE-ET-LOIR", session == "2020") %>% 
  pivot_longer(c(contains("pct")), # pivot longer to allow for a clean and easy boxplot graph with each pct
               names_to = "Mention_type",
               values_to = "Rate")
Eure$Mention_type <- factor(Eure$Mention_type, levels = c("success_rate_pct","without_pct", "AB_pct", "B_pct", "TB_pct")) #creation of factor to order the graph

 p <- Eure %>% 
 ggplot(aes(x = Mention_type, 
             y = Rate, 
             fill = Mention_type, 
             text = establishment_name,
             text2 = Rate)) +
        geom_boxplot()+
        geom_jitter(width = 0.25, alpha = 0.5)+
        guides(fill = "none")+
        scale_fill_viridis(discrete = TRUE) +
        theme_ipsum() +
        labs( x= "", 
              y = "Rate in %",
              title ="Results for Eure et Loir establishments in 2020")

ggplotly(p, tooltip = c("text","text2"))
Eure et Loir 2006

Eure <- dnb_results %>%
  select(school_id,establishment_name,department,session, contains("pct")) %>% 
  dplyr::filter(department == "EURE-ET-LOIR", session == "2006") %>% 
  pivot_longer(c(contains("pct")), # pivot longer to allow for a clean and easy boxplot graph with each pct
               names_to = "Mention_type",
               values_to = "Rate")
Eure$Mention_type <- factor(Eure$Mention_type, levels = c("success_rate_pct","without_pct", "AB_pct", "B_pct", "TB_pct")) #creation of factor to order the graph

 p <- Eure %>% 
 ggplot(aes(x = Mention_type, 
             y = Rate, 
             fill = Mention_type, 
             text = establishment_name,
             text2 = Rate)) +
        geom_boxplot()+
        geom_jitter(width = 0.25, alpha = 0.5)+
        guides(fill = "none")+
        scale_fill_viridis(discrete = TRUE) +
        theme_ipsum() +
        labs( x= "", 
              y = "Rate in %",
              title ="Results for Eure et Loir establishments in 2006")

ggplotly(p, tooltip = c("text","text2"))

Paris

On average, the success rate in all Paris establishments reached 93% in 2020. Representing thus, the highest performance recorded since 2006. Accordingly, the number of highest honour admissions outpaced the number of admissions with other distinctions.

Conversely, the success rate was much lower in 2006: It was around [ 40%- 60%], compared with [66%-100%] in 2020. In the same year, the number of students succeeding with a standard pass distinction was the highest, while the number of admissions with highest honour represented the smallest percentage.

Eure et Loire

In 2020, the results of the department of Eure et Loire followed the same pattern as those of the Paris department. It reached its peak in 2020, with an average success rate of 87.90%, compared to only 81% in 2006. Similarly, the number of admissions with highest honour outpaced the number of successes with other distinctions, in 2020.

However, in 2006, the number of students graduating with a standard pass distinction outstripped the one of those graduating with all the other distinctions.

Overall, in 2020, the Paris department outperformed the department of Eure et Loire, with an average success rate of 92.40% compared to only 87.90% for Eure et Loire. In the same context, the number of admissions with highest honour in the department of Paris outstripped largely the one of Eure et Loire. It was around [29.17% -55.21%] in Paris and around [14.52%-33.53%], in the department of Eure et Loire. Also, the number of standard pass admissions was remarkably higher in Eure et Loire. It represented 24.44% of the total number of admissions in 2020, compared to only 12.97% in Paris. Similarly, the department recorded an average rate of admissions with the honours distinction equal to 22.86%, compared to only 17.59% in Paris. We see that the dispersion pattern of the establishment is rather similar between the departments but changes between 2006 and 2020.

4.2 Establishment_24

After analyzing the data provided by the dnb-result data set and understanding the evolution of French secondary students’ performance during the years [2006-2021], as well as the most significant events that marked this period. It is time to explore the Establishment_24 data set.

We started our analysis by mapping the establishments labelled Generation 2024 across France using the ggplot’s functions, geo_polygon combined with geom_point. This helped us visualize the geographic distribution of the labelled establishments and understand where they are mostly located.


p <- ggplot() +
  geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") +
  geom_point(data = establishment_24, aes(x = longitude , y = latitude, text = department_fr, text2 = establishment),  size = 0.5)+
  coord_map() +
  labs(x = "", 
       y = "", 
       title = 'Establishment labelled "Génération 2024') +
  map_theme

ggplotly(p, tooltip = c("text", "text2"))

Based on the map, we conclude that the establishments labelled generation 2024 are mostly located in the region of Île-de-France, Auvergne-Rhône-Alpes, Occitanie and Nouvelle Aquitaine.However, we can hardly conclude anything with such graph and data. We decided to use the summary of labelled establishment by department for the rest of this exploration.

New establishment 2024 labellization per department and year

After having a general idea about the geographic distribution of generation 2024’s establishments, we created this second map to analyze their evolution in the period [2017-2022], for each department. For this, we used the function filter and ggplot to map for each year, starting from 2017, the number of new labelled establishment, and this for all French departments.

establishment_24_dep_map <- left_join(x = map[,-6], y = establishment_24_dep, by = "department_fr")
2017


p <- establishment_24_dep_map  %>% 
  filter(session_started == 2017) %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             text = department_fr)) +
  geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") + # here to add the outline of all departments
  geom_polygon(aes(fill= establishment), color = "black") +
  scale_fill_viridis(name = "Number of establishment", limits = c(0, 50)) +
  labs(x = "", 
       y = "", 
       title = " New establishment labelled Generation 2024 in 2017")+
  map_theme


ggplotly(p, tooltip = c("text","fill"))
2018


p <-establishment_24_dep_map  %>% 
  filter(session_started == 2018) %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             text = department_fr)) +
  geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") + # here to add the outline of all departments
  geom_polygon(aes(fill= establishment), color = "black") +
  scale_fill_viridis(name = "Number of establishment", limits = c(0, 50)) +
  labs(x = "", 
       y = "", 
       title = " New establishment labelled Generation 2024 in 2018")+
  map_theme

ggplotly(p, tooltip = c("text","fill"))
2019


p <-establishment_24_dep_map  %>% 
  filter(session_started == 2019) %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             text = department_fr)) +
  geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") + # here to add the outline of all departments
  geom_polygon(aes(fill= establishment), color = "black") +
  scale_fill_viridis(name = "Number of establishment", limits = c(0, 50)) +
  labs(x = "", 
       y = "", 
       title = "New establishment labelled Generation 2024 in 2019")+
  map_theme

ggplotly(p, tooltip = c("text","fill"))
2020


p <-establishment_24_dep_map  %>% 
  filter(session_started == 2020) %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             text = department_fr)) +
  geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") + # here to add the outline of all departments
  geom_polygon(aes(fill= establishment), color = "black") +
  scale_fill_viridis(name = "Number of establishment", limits = c(0, 50)) +
  labs(x = "", 
       y = "", 
       title = "New establishment labelled Generation 2024 in 2020")+
  map_theme

ggplotly(p, tooltip = c("text","fill"))
2021


p <-establishment_24_dep_map  %>% 
  filter(session_started == 2021) %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             text = department_fr)) +
  geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") + # here to add the outline of all departments
  geom_polygon(aes(fill= establishment), color = "black") +
  scale_fill_viridis(name = "Number of establishment", limits = c(0, 50)) +
  labs(x = "", 
       y = "", 
       title = "New establishment labelled Generation 2024 in 2021")+
  map_theme

ggplotly(p, tooltip = c("text","fill"))
2022


p <-establishment_24_dep_map  %>% 
  filter(session_started == 2022) %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             text = department_fr)) +
  geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") + # here to add the outline of all departments
  geom_polygon(aes(fill= establishment), color = "black") +
  scale_fill_viridis(name = "Number of establishment", limits = c(0, 50)) +
  labs(x = "", 
       y = "", 
       title = "New establishment labelled Generation 2024 in 2022")+
  map_theme

ggplotly(p, tooltip = c("text","fill"))

At first glance, we notice the positive evolution of the number of labelled establishments. In 2017, approximatively, twenty labeled establishments were recorded within the French departments, with four registered by the Nord, being the highest number reported that year. This tendency continued until 2020, when almost all departments recorded new labeled establishments.

However, this pattern was disrupted in 2021, when only 15 departments recorded an evolution of the number of their labeled establishments. Surprisingly, 2022 has been the peak year so far: Only four departments did not declare an increase in this number.

At this point of our analysis, we wanted to analyze the percentage of labeled establishments across all French departments. To do that, we computed the rate of 2024 establishments for each department. We first counted the number of establishment by department using the establishments present in the data set dnb_results. From this, we just had to measure the rate of labelled establishment. Next, we used the ggplot function to map our data.

# We first need to count the number of establishment per department
nb_est <- dnb_results %>% 
  select(school_id, department_fr) %>% 
  group_by(department_fr) %>% 
  summarise(n_distinct(school_id))
# We include the values with the one from establishment_24 
establishment_24_dnb_dep <- left_join(x = nb_est, y = establishment_24_dep)

# We measure the rate of labelled establishment
establishment_24_dnb_dep <- establishment_24_dnb_dep %>% 
  group_by(department_fr) %>% 
  summarise(establishment = sum(establishment),
            nb_est = mean(`n_distinct(school_id)`)) %>% #use mean to keep the value of n_distinct(schoold_id) 
  mutate(rate = establishment/nb_est*100)

establishment_24_dnb_dep_map <- left_join(x = map[,-6], y = establishment_24_dnb_dep, by = "department_fr")

p <-establishment_24_dnb_dep_map  %>% 
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             text = department_fr)) +
  geom_polygon(aes(fill= rate), color = "black") +
  scale_fill_viridis(name = "Rate of labelled establishment in % ") +
  labs(x = "", 
       y = "", 
       title = "Establishment 2024 in 2022")+
  map_theme

ggplotly(p, tooltip = c("text","fill"))

Based on the map, the department of Hautes-Pyrenees has the highest number of labeled establishments so far, with a rate of 52.78% in 2022. Haute-Corse took the second position, having 50% of its establishments labelled by the end of 2022. On the third position, 43.14% of the establishmentsof of Val-Doise were labeled by the end of 2022. Other departments have also shown interesting rates such as Essonne (41.43%), Corse du Sud (40%), Tarn-et-Garonne (40%).

It is interesting to see that even if the Olympic games will take place in Paris the establishments from the Ile de france region have relatively low labeling rates. Indeed apart from Val-Doise and Essone which have high labeling rates, the other departments are rather around 20%. The explanation could be related to the number of establishments in these departments. For example, Paris has 216 establishments whereas Val-Doise and Essone have respectively 153 and 140 establishments. Similarly the two Corsican departments and Hautes-Pyrenees only have 15, 16 and 36 establishments. To conclude it explains why some departments may have higher rates but based on the principle that establishments of a same department have a smimilar number of students and that we will compare this data to success rate by department it should not affect our end results.

4.3 student_housing

After analyzing French secondary students’ success rate on a national, regional and a departmental level, and exploring the establishment_24 dataset, it is time to focus on the students housing dataset. Our goal is to analyze the evolution of the different housing modes between 2020 and 2022.

The following graph plots the number of students and groups them by their Housing option.

p <- student_housing %>% 
  select(session, external_students_secondary_education, half_boarders_students_secondary_education, boarding_students_secondary_education) %>% 
  group_by(session) %>% 
  summarise(external_students_secondary_education = sum(external_students_secondary_education, na.rm = TRUE),
            half_boarders_students_secondary_education = sum(half_boarders_students_secondary_education, na.rm = TRUE), 
            boarding_students_secondary_education = sum(boarding_students_secondary_education, na.rm = TRUE)) %>% 
  gather("Housing_option", "students", 2:4) %>%  
  ggplot(aes(x=session, 
             y=students,
             fill=Housing_option)) +
    geom_bar(stat="identity", position=position_dodge())+
    ggtitle("Students per housing option in France")+
    xlab("Session")+
    ylab("Number of students")+
    scale_fill_viridis(discrete = TRUE)+
    theme_ipsum()

ggplotly(p, tooltip = "y")

Based on the chart, half boarding students represents the largest category, recording 3’654’401 students in 2020, compared to 1’536’776 external students, and only 191’100 boarding students in the same year. This category remains the undisputed leader in 2021 and 2022. The students’ distribution has not changed remarkably in the period [2020-2022]. However, we notice a slight decrease in the number of half-boarders ’students. French secondary students are less and less interested in the half-boarding housing mode. They were 3’654’401 students in 2020, compared to 3’593’806 in 2022, representing a negative variation of 1.65%. Conversely, the external housing mode is attracting more and more students. The number of externals showed a positive change of 4.94% in 2022 compared to 2020. It increased from 1’536’776 students in 2019 to 1’612’770 in 2022.

Housing option rate in secondary education per department

To conduct a departmental analysis, we created the following graph, in which we plotted for each of the housing modes, the rate of students recorded in each specific category for all French departments.

external students

It is to be noted that for a better visual interpretation the scale of the y-axis is adapted for each graph.


p <- housing_dep %>% 
  ggplot(aes(x = as.factor(session), 
             y = external_students_rate, 
             group = session, 
             fill = session, 
             text = department_fr,
             text2 = external_students_rate)) +
  geom_boxplot()+
  geom_jitter(width = 0.25, alpha = 0.5)+ 
  scale_fill_viridis()+
  guides(fill = "none")+
  labs( x= "",
        y = "Rate of external students in %",
        title ="Rate of external students of each Department by session")+
  theme_ipsum()

ggplotly(p, tooltip = c("text","text2"))
half-boarders students

It is to be noted that for a better visual interpretation the scale of the y-axis is adapted for each graph.


p <- housing_dep %>% 
  ggplot(aes(x = as.factor(session), 
             y = half_boarders_students_rate, 
             group = session, 
             fill = session, 
             text = department_fr,
             text2 = half_boarders_students_rate)) +
  geom_boxplot()+
  geom_jitter(width = 0.25, alpha = 0.5)+ 
  scale_fill_viridis()+
  guides(fill = "none")+
  labs( x= "",
        y = "Rate of half-boarders students in %",
        title ="Rate of half-boarders  students of each Department by session")+
  theme_ipsum()

ggplotly(p, tooltip = c("text","text2"))
boarding students

It is to be noted that for a better visual interpretation the scale of the y-axis is adapted for each graph.


p <- housing_dep %>% 
  ggplot(aes(x = as.factor(session), 
             y = boarding_students_rate, 
             group = session, 
             fill = session, 
             text = department_fr,
             text2 = boarding_students_rate)) +
  geom_boxplot()+
  geom_jitter(width = 0.25, alpha = 0.5)+ 
  scale_fill_viridis()+
  guides(fill = "none")+
  labs( x= "",
        y = "Rate of boarding students in %",
        title ="Rate of boarding students of each Department by session")+
  theme_ipsum()

ggplotly(p, tooltip = c("text","text2"))

Half boarding housing mode attracts the higher number of students in all French departments. On average, 69 % of students opted for this residence mode in 2020, 69.16% in 2021 and 68.88% in 2022. In 2020, 83.43% of Cotes_Darmor’ students were half boarders, which is the highest rate recorded in all French departments. Contrarily, only 49.7% of Seine-Saint-Denis’ students chose this housing mode, which is the lowest rate recorded that year. In 2021 and 2022, we notice the same distribution of half boarders’ students among French departments, with the highest rate registered by Cotes_Darmor and Vendee, and the lowest recorded with Seine-Saint-Denis. External housing mode is the second common one after the half boarding mode. In 2020, external students represented 24.71% of the total number of students, compared to 24.61% and 24.70% in 2021 and 2022, respectively. Landes, Cotes_Darmor and Vendee are the departments with the least number of external students, with a respective rate of 8.01 %, 8.42% and 8.49%. Seine-Saint-Denis occupies the first position, recording the highest number of externals: 50%. Similarly, in 2021 and 2022, those three departments recorded the lowest rates while the department of Seine-Saint-Denis recorded the highest. Finally, the boarding housing mode attracts the least number of students. It recorded the lowest rates over our study period. On average, boarding students represented 5.69% of the total number of students in 2020, 5.35 % in 2021 and 5.36 % in 2022. In 2020, Paris and Seine-Saint-Denis recorded the lowest rate of boarding students, 0.260% and 0.287 % respectively. Lozere in the other hand, recorded 25.79%, which is the highest rate of all the departments. We notice the same tendency in 2021 and 2022.

4.4 single_parent

Our last data set related to socioeconomical factor is the single parent one, The first visualisation we wanted to do was a barplot that would show the evolution of the total number of single-parent families in France from 2007 to 2008. To do this, we had to isolate, the sing_par and session variables, then summarized the number of single-parents with “sum”. We then were able to use geom_bar.

sp1_ <- single_parent %>%
  select(c("session", "sing_par")) %>%
  group_by(session)%>%
  summarise (sing_par = sum(sing_par, na.rm = TRUE))


p <- ggplot(data = sp1_, aes(x = as.factor(session), 
                             y = sing_par, 
                             fill = as.factor (session),
                             text = sing_par))+
  geom_col(stat = "identity" )+
  scale_fill_viridis(discrete = TRUE, "Session") +
  labs(x = "", y = "Number of single parent families")+
  theme_ipsum()

ggplotly(p, tooltip = "text")

Based on the barplot, we can make an overall analysis of the evolution of single-parent families in France over the years. Although we are missing several years, we can clearly see that the number of single-parent families has continued to increase since 2007. In fact, single-parent families numbered 2’427’110 in 2007, whereas in 2018, they were numbering 3’031’823. In just over 10 years, there has been a 20% increase.

Single-parent families by departments, 2007

We needed to create a data set to create maps.For this, we joined the single_parent and map data sets using department_fr.

jmap_sp<- left_join(x = map[,-6], y = single_parent, by = "department_fr")
p <-jmap_sp %>% 
  filter(session == 2007) %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             text = department_fr)) +
  geom_polygon(aes(fill= sing_par), color = "black") +
  scale_fill_viridis(name = "Number of single parent families") +
  labs(x = "", 
       y = "", 
       title = "Single parent families in 2007")+
  map_theme

ggplotly(p, tooltip = c("text","fill"))

From the map, we realise that the departments with the most single-parent families are those with the most inhabitants (Bien dans ma ville, 2022). This is an unsurprising result which influence our choice to create variables for the data to become easily comparable.

To do this we first had to join single_parents and dnb_results_dep by department_fr and session to have the number of students per department. From this, we used the mutate function to create a new variable that divides sing_par by the number of students admitted by department. Thanks to this, we were able to create maps of France showing a proportion of the number of single-parent families per student admitted from 2007 to 2018. To create the interactive maps with nice colours, we used the function ggplotly.

sing_dnb <- left_join(x = single_parent, y = dnb_results_dep, by = c("department_fr", "session"))

sing_dnb <- sing_dnb %>% 
  mutate(single_parent_per_student_admitted = sing_par/admitted)
#Join with the map data set for the mapping. 
sing_dnb_map <- left_join(x = map[,-6], y = sing_dnb, by = "department_fr")

Single parent families per student admitted

2007


p <-sing_dnb_map %>% 
  filter(session == 2007) %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             text = department_fr)) +
  geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
  scale_fill_viridis(name = "Single parent families per student", limits = c(0, 8)) +
  labs(x = "", 
       y = "", 
       title = "Single parent families per student in 2007")+
  map_theme

ggplotly(p, tooltip = c("text","fill"))
2008


p <-sing_dnb_map %>% 
  filter(session == 2008) %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             text = department_fr)) +
  geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
  scale_fill_viridis(name = "Single parent families per student", limits = c(0, 8)) +
  labs(x = "", 
       y = "", 
       title = "Single parent families per student in 2008")+
  map_theme

ggplotly(p, tooltip = c("text","fill"))
2012


p <-sing_dnb_map %>% 
  filter(session == 2012) %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             text = department_fr)) +
  geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
  scale_fill_viridis(name = "Single parent families per student", limits = c(0, 8)) +
  labs(x = "", 
       y = "", 
       title = "Single parent families per student in 2012")+
  map_theme

ggplotly(p, tooltip = c("text","fill"))
2013


p <-sing_dnb_map %>% 
  filter(session == 2013) %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             text = department_fr)) +
  geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
  scale_fill_viridis(name = "Single parent families per student", limits = c(0, 8)) +
  labs(x = "", 
       y = "", 
       title = "Single parent families per student in 2013")+
  map_theme

ggplotly(p, tooltip = c("text","fill"))
2017


p <-sing_dnb_map %>% 
  filter(session == 2017) %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             text = department_fr)) +
  geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
  scale_fill_viridis(name = "Single parent families per student", limits = c(0, 8)) +
  labs(x = "", 
       y = "", 
       title = "Single parent families per student in 2017")+
  map_theme

ggplotly(p, tooltip = c("text","fill"))
2018


p <-sing_dnb_map %>% 
  filter(session == 2018) %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             text = department_fr)) +
  geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
  scale_fill_viridis(name = "Single parent families per student", limits = c(0, 8)) +
  labs(x = "", 
       y = "", 
       title = "Single parent families per student in 2018")+
  map_theme

ggplotly(p, tooltip = c("text","fill"))

It can be noted that the departments in the East have the lowest single parent rate (Mayenne, Maine-et-Loire, Vendée), while the departments in the Iles-de-France region (Paris, Seine- Saint-Denis, Marne), those around the Mediterranean Sea (Bouche-du-Rhone, Pyrenees, Herault) and those in Corsica (Haute-Corse, Corse du Sud) have the highest rates. It will therefore be important to keep an eye on these departments when analyzing the results of Diploma Brevet National (DNB).

4.5 covid_in_schools

First, we wanted to get an overview of COVID positive cases in France over the covid period. We selected from covid_in_school the variables positive and test_date, group by test_date and summarised positive with sum. We then use ggplot with test_date as the x-axis and positive for the y-axis.

p <- covid_in_schools %>% 
  select(positive, test_date) %>% 
  group_by(test_date) %>% 
  summarise(positive = sum(positive)) %>% 
  ggplot( mapping = aes(x= test_date, y = positive)) +
  geom_line() +
  labs(title = "French  Covid-19 cases for the age group 11 to 15 years old  (2020-2022)", x = "Date", y = "Number of cases")+
  theme_ipsum()

ggplotly(p, tooltip = c("x","y"))

From 2020 until the end of 2021, we notice waves of covid cases. Indeed, every 6 months, the number of positive cases increases to nearly 20’000 positive cases per day and then drops to around 1’000 positive cases per day. However, we observe a clear increase, with a peak of more than 235,000 positive cases per day, in January 2022. After this increase, the number of positive cases fell significantly to remain around 2’000-10’000 positive cases per day. The 2022 academic year is therefore a year to keep an eye on, for further researches as we do not yet have the dnb results of this session.

For the second exploratory analysis, we wanted to compare the positive covid cases by regions. As population differs between departemnts, it was easier to compare using incidence_rate. To create the map, we selected from the covid_in_schools dataset the variables department_code, incidence_rate, session, region, department_fr and test_date, did a group_by region and test_date and summarised positive by doing an average.

p <- covid_in_schools %>%
  select(c("department_code", "incidence_rate", "session", "region", "department_fr", "test_date", "positive")) %>%
  group_by(region, test_date) %>%
  summarise(incidence_rate = mean(incidence_rate, na.rm = TRUE), 
            positive = sum(positive)) %>% 
  ggplot() +
  geom_line(mapping = aes(x = test_date, y = incidence_rate, color = region))+
   scale_color_viridis(discrete = TRUE) +
    labs(title = "Covid-19 incidence rate for the age group 11 to 15 years old by region (2020-2022)", x = "Date", y = "Incidence rate") +
    theme_ipsum()

ggplotly(p, tooltip = c("x","y", "color"))

For the age group 11 to 15, Ile-de-France and Auvergne-Rhone-Alpes seem to be the two regions which were the most affected in the first waves. The Covid19 did not spare any department as we their incidence rate closely follow each others. The only slight exception is the wave in March-April 2022 where a few departments like Corse and Bretagne were more affected.

Incidence rate average per session

The last vizualization for the covid dataset is a mapping of incidence rate average by departments from session 2020 to 2023.

covidpos_dep <- covid_in_schools %>% 
  select(c("department_code", "incidence_rate", "session", "department_fr" )) %>% 
  group_by(department_fr, session) %>%
  summarise(incidence_rate = mean(incidence_rate, na.rm = TRUE))

covidpos_dep <- left_join(x = map[,-6], y = covidpos_dep)
2020

It is to be noted that for a better visual interpretation the scale is adapted for each graph.


p <- covidpos_dep %>% 
  filter(session == 2020) %>% 
  ggplot( aes(x= long, y= lat, group=group, text = department_fr)) +
  geom_polygon(aes(fill= incidence_rate), color = "black") +
  coord_map()+
  scale_fill_viridis(name = "Incidence rate average in 2020")+
  map_theme

ggplotly(p)
2021

It is to be noted that for a better visual interpretation the scale is adapted for each graph.


p <- covidpos_dep %>% 
  filter(session == 2021) %>% 
  ggplot( aes(x= long, y= lat, group=group, text = department_fr)) +
  geom_polygon(aes(fill= incidence_rate), color = "black") +
  coord_map()+
  scale_fill_viridis(name = "Incidence rate average in 2021")+
  map_theme

ggplotly(p)
2022

It is to be noted that for a better visual interpretation the scale is adapted for each graph.


p <- covidpos_dep %>% 
  filter(session == 2022) %>% 
  ggplot( aes(x= long, y= lat, group=group, text = department_fr)) +
  geom_polygon(aes(fill= incidence_rate), color = "black") +
  coord_map()+
  scale_fill_viridis(name = "Incidence rate average in 2022")+
  map_theme

ggplotly(p)
2023

It is to be noted that for a better visual interpretation the scale is adapted for each graph.


p <- covidpos_dep %>% 
  filter(session == 2023) %>% 
  ggplot( aes(x= long, y= lat, group=group, text = department_fr)) +
  geom_polygon(aes(fill= incidence_rate), color = "black") +
  coord_map()+
  scale_fill_viridis(name = "Incidence rate average in 2023")+
  map_theme

ggplotly(p)

During the 2020 session, the average incidence rate is relatively low in all departments of France, except in Mayenne where the incidence rate is 17.96, which is the highest rate. For the 2021 session, the scale indicates a clear increase in the proportions of average incidence rate in each of the departments, ranging from 246.9 to 52.6 cases per department. The western part of France is much more affected than the eastern part. We can hypothesise that this increase may be due to the fact that these are the departments closest to the borders and other countries. During the 2022 session, we can observe an overall increase in the incidence rate in each of the departments. Aveyron, Landes, Pyrénées-Atlantiques and Corse du Sud are the departments with the highest rates, with a maximum of 1201. For the 2023 session, the incidence rate fell significantly, particularly in Ile de France. This is because the school year is not over yet. La Creuse is the department with the highest rate with 420.

5 Analysis

In the first part of our project, we focused on exploring our data. We created several charts; bar plots, box plots, maps etc.; to have a general comprehension of our data and retrieve the general tendencies.

This part of the project will be more focused on finding answers in order to draw conclusions about the evolution of the French secondary students’ performance and identifying the variables that have impacted it the most.

5.1 What is the evolution of student performance over time and across the different regions/departments of France?

We have seen previsouly that French secondary students’ performance is showing a positive evolution. The number of admissions has been increasing since 2006, it is 23.15% higher in 2020 compared to 2006. Not only the overall success rate, but also the distinctions that students get have changed between 2006 and 2021. During this period, more and more students passed with a TB distinction (23.98% in 2021), compared to only 4.49 % in 2006. In the same way, the rate of students passing with no mention decreased remarkably. It was 53.85% in 2011, compared to only 25.11% in 2021.


p <- France_results %>% 
  ggplot(aes(x = session, y = Rate, group = Mention_type, color = Mention_type))+
  geom_line()+
  scale_color_viridis(discrete = TRUE) +
    ggtitle("National DNB statistics") +
    theme_ipsum() +
    ylab("Rate in %")

ggplotly(p, tooltip = c("x" ,"y"))

2017: A drastic improvement in students’ performance was recorded in 2017.

Students’ performance has drastically improved in 2017, recording a success rate of 88.74% for the first time since 2006. Also, the number of admissions with a TB distinction expanded during that year, reaching 25.12% compared to only 4.49% in 2006. In the same context, admissions without a mention reached their lowest point so far, with a rate of 23.40%.

Based on further literature research, we discovered that this major shift in the French secondary students’ performance has to do with the reform “learn better to succeed better”, that was brought into force at the beginning of the 2016 school year (accounted for session 2017 in our analysis) by the minister of education Najat Vallaud-Belkacem. This reform is articulated around two main points: redefining the common core of knowledge and skills as well as the content of the programs to make them more comprehensible, more progressive, and more coherent. The focus was mainly on these three areas: French, mathematics, and history. Moreover, teachers had to devote 20% of the class time to new pedagogical methods such as the practice of multidisciplinary teaching, providing personalized support, and engaging students in collaborative approaches.

2019: Sudden decrease in students’ performance

Students’ performance deteriorated sharply in 2019. The rate of admissions with a TB distinction dropped remarkably, passing from 25.12% in 2017, to 19.66% in 2019. Accordingly, the number of admissions without a mention increased, reaching 30.63% compared to 23.40% in 2017.

2020: students’ performance reached its peak

In 2020, students’ performance has reached its peak. 90.19% of French secondary school students succeeded that year, compared to only 86.14% in 2019. On one hand, the number of admissions with a TB mention improved drastically compared to 2019, with a rate of 28.45%. On the other hand, the admissions without a mention registered a rate of 21.07%, being the lowest rate recorded since 2006.

Based on our research, we assume that this unique performance has to do with the decision of exams cancellation that was taken by the education minister Jean-Michel Blanquer in 2020, due to the corona virus pandemic: France canceled traditional end of secondary school exams in favor of continuous assessment. So, instead of sit-down tests in the summer, students received an average score, for each subject, based on the grades scored in tests and homework throughout the academic year.

5.2 Do socio-economic factors such as the type of accommodation, family situation or college policies have an influence on student success ?

5.2.1 establishment_24

We first mapped the establishment labelled Generation 2024 onto a map of the average success rate for the brevet during the period 2006-20021. This visualisation helped us determine whether a clear link could be made or if further analysis need to be performed. To create the map, we first had to calculate the average success rate and join the result with the map data set.


result <- dnb_results %>% 
  select(department_fr, success_rate_pct) %>% 
  group_by(department_fr) %>% 
  summarise(success_rate = mean(success_rate_pct, na.rm = TRUE))


result_map <- left_join(x = map[,-6], y = result)

Then, with the help of ggplot, we have superimposed the map of the average success rate and the dotted map of the localisation of the labelled establishment presented in chapter 4.2.


p <-   ggplot() +
  geom_polygon(data = result_map, aes(long,lat, 
                            group = group, 
                            fill = success_rate, 
                            text = department_fr)) +
  geom_point(data = establishment_24, aes(x = longitude, y = latitude, text2= establishment), size = 0.5)+
  coord_map() +
  scale_fill_viridis(name = "Average sucess rate, 2006-2021")+
  labs(x = "", 
       y = "", 
       title = "Labelled Establishemnt vs average success rate") +
  map_theme
  
ggplotly(p, tooltip = c("text2", "text") ) #due to a strange behavior of ggplotly we must add the "text" variable in the tooltip otherwise the success rate map is only partially displayed. 

Zooming on the map, one can see that there is no clear pattern of correlation between high achieving departments and great number of labelled establishment. Indeed, one can see that for example, there are many labelled establishment around Paris and the success rate varies greatly. Building from this, we do not expect to have a strong relation between the success rate and the labelling or not of an establishment.

To test this statement, we did a linear regression at the establishment level. We used the Generation 2024 label as a boolean with 1 for labelled establishment and 0 for unlabelled ones. To create this variable we first had to add the school_id and session_started variables from establishment_24 to dnb_results. We used left_join by the variable school_id to keep all the establishments present in dnb_results. To add the variable est_24, we used the mutate coupled with the case_when to cover all cases. est_24 takes the value 1 when the session is greater or equal than session_started and the value 0 for all other cases which are when session is smaller than session_started or the establishment is not labelled and the session_startedvalue is NA.

est_24_join <- establishment_24 %>% 
  select(school_id, session_started)

est_24_dnb <- left_join( x = dnb_results, y = est_24_join, by = c("school_id"))

est_24_dnb <- est_24_dnb %>% 
  filter(session >= 2017) %>% 
  mutate(est_24 = case_when(session >= session_started ~ 1,
                            session <  session_started ~ 0, 
                            is.na(session_started) == TRUE ~ 0 ))

To have a visual representation of the possible relation between the success rate and the labelling, we used geom_bin_2d with est_24 as factor for the x-axis and success_rate_pctfor the y-axis. We used this way of plotting to display in a more visual way the dispersion of the establishments on the y-axis. To do the linear regression we used the lm function.

p <- ggplot(data = est_24_dnb )+
geom_bin_2d( aes( x = as.factor(est_24), y = success_rate_pct))+
  labs(x = "est_24")+
  theme_ipsum()+
  scale_fill_viridis("Number of establishment")

ggplotly(p, tooltip = c("y", "fill"))
lm_est_24_dnb <- lm(data = est_24_dnb, success_rate_pct ~ est_24)
tab_model(lm_est_24_dnb)
  success_rate_pct
Predictors Estimates CI p
(Intercept) 88.09 88.00 – 88.19 <0.001
est 24 -0.60 -1.00 – -0.19 0.004
Observations 42472
R2 / R2 adjusted 0.000 / 0.000

The result of this linear regression shows that there could be a potential negative aspect to have the Generation 2024 label. The R squared being below 0.000 makes this analysis highly unreliable as it would explain less than 0.01% of the variation in success rate.

We decided to continue our analysis and proceeded to perform a linear regression with each distinction. To performed them we followed the same steps as for the success rate regression analysis.

Linear regression for each mention

Without
p <- ggplot(data = est_24_dnb )+
geom_bin_2d( aes( x = as.factor(est_24), y = without_pct))+
  labs(x = "est_24")+
  theme_ipsum()+
  scale_fill_viridis("Number of establishment")
ggplotly(p, tooltip = c("y", "fill"))
lm_est_24_dnb <- lm(data = est_24_dnb, without_pct ~ est_24)
tab_model(lm_est_24_dnb)
  without_pct
Predictors Estimates CI p
(Intercept) 25.44 25.31 – 25.57 <0.001
est 24 -1.21 -1.77 – -0.65 <0.001
Observations 42458
R2 / R2 adjusted 0.000 / 0.000
AB
p <- ggplot(data = est_24_dnb )+
geom_bin_2d( aes( x = as.factor(est_24), y = AB_pct))+
  labs(x = "est_24")+
  theme_ipsum()+
  scale_fill_viridis("Number of establishment")

ggplotly(p, tooltip = c("y", "fill"))
lm_est_24_dnb <- lm(data = est_24_dnb, AB_pct ~ est_24)
tab_model(lm_est_24_dnb)
  AB_pct
Predictors Estimates CI p
(Intercept) 25.68 25.58 – 25.77 <0.001
est 24 -2.30 -2.70 – -1.89 <0.001
Observations 42458
R2 / R2 adjusted 0.003 / 0.003
B
p <- ggplot(data = est_24_dnb )+
geom_bin_2d( aes( x = as.factor(est_24), y = B_pct))+
  labs(x = "est_24")+
  theme_ipsum()+
  scale_fill_viridis("Number of establishment")

ggplotly(p, tooltip = c("y", "fill"))
lm_est_24_dnb <- lm(data = est_24_dnb, B_pct ~ est_24)
tab_model(lm_est_24_dnb)
  B_pct
Predictors Estimates CI p
(Intercept) 25.27 25.18 – 25.36 <0.001
est 24 -0.43 -0.81 – -0.05 0.027
Observations 42458
R2 / R2 adjusted 0.000 / 0.000
TB
p <- ggplot(data = est_24_dnb )+
geom_bin_2d( aes( x = as.factor(est_24), y = TB_pct))+
  labs(x = "est_24")+
  theme_ipsum()+
  scale_fill_viridis("Number of establishment")

ggplotly(p, tooltip = c("y", "fill"))
lm_est_24_dnb <- lm(data = est_24_dnb, TB_pct ~ est_24)
tab_model(lm_est_24_dnb)
  TB_pct
Predictors Estimates CI p
(Intercept) 23.61 23.47 – 23.75 <0.001
est 24 3.94 3.34 – 4.53 <0.001
Observations 42458
R2 / R2 adjusted 0.004 / 0.004

The R squared is unsurprisingly very low again meaning that we can not conclude any strong relation between students achievements and the Generation 2024 label. It is still to be mentioned that it affects positively the TB attribution rate and negatively the rest.

We can conclude that from our analysis we do not see any pattern that the Generation 2024 label influences has any influence on the results of the establishments. It is not that surprising as the objectives of the label is truly to promote the olympics in Paris in 2024 and link the schoold world to the sport’s one. It has no requirements on a amount of sport done during the school time. The total amount of sport practised by students is therefore probably not much greater in a labelled institution than in a non-labelled one, hence, the little influence.

5.2.2 student_housing

To analyse the impact of the housing offerings of the establishments on the results of their student, we first need to join the student_housing data set and the dnb_results one. We filtered out the unnecessary or repetitive variables of dnb_results such as the academy name and code or the education sector. We used inner_join by session, school_id and department_fr to keep only establishment appearing in student_housing.

dnb_prep_housing <- dnb_results %>% 
  select(session, school_id, registered:TB_pct)
housing_dnb <- inner_join(x = student_housing , y = dnb_prep_housing, by = c("session", "school_id", "department_fr"))

It is to be noted that the number of students between students_secondary_education and registered does not match as students of the four years of college are accounted for in students_secondary_education.

5.2.2.1 Linear regression

We evaluated the possibility of doing a multiple linear regression by measuring the correlation of the external_students_rate, half_boarders_students_rate and boarding_students_rate variables.


corrplot(cor(housing_dnb[c(17:19)]), col = viridis(256)) 

The correlation is unsurprisingly very high as if for example one establishment offers a meal for dinner most of the students will take it and very few will eat at home. Therefore, we will use single linear regression to asses whether there exist a link between student results and their habitual place of eating and living.

Linear regression for each offering

We simply used the lm function for each offering. We did not include graphs in the final report as they were completely overloaded.

External students

lm_housing_dnb <- lm(data = housing_dnb, success_rate_pct ~ external_students_rate)
tab_model(lm_housing_dnb)
  success_rate_pct
Predictors Estimates CI p
(Intercept) 91.92 91.72 – 92.12 <0.001
external students rate -0.10 -0.10 – -0.09 <0.001
Observations 15803
R2 / R2 adjusted 0.070 / 0.070
Half-boarder students

lm_housing_dnb <- lm(data = housing_dnb, success_rate_pct ~ half_boarders_students_rate)
tab_model(lm_housing_dnb)
  success_rate_pct
Predictors Estimates CI p
(Intercept) 82.04 81.64 – 82.43 <0.001
half boarders students
rate
0.10 0.10 – 0.11 <0.001
Observations 15803
R2 / R2 adjusted 0.084 / 0.084
Boarding students

lm_housing_dnb <- lm(data = housing_dnb, success_rate_pct ~ boarding_students_rate)
tab_model(lm_housing_dnb)
  success_rate_pct
Predictors Estimates CI p
(Intercept) 89.43 89.28 – 89.57 <0.001
boarding students rate -0.06 -0.07 – -0.05 <0.001
Observations 15803
R2 / R2 adjusted 0.005 / 0.005

The p-value is excellent for each regression but as observed with student housing, the R squared is very low. However, we see some intersting variations in the intercept, a difference of 10%, that could prove to be insightful. To dig deeper into these variations, we performed a cluster analysis.

5.2.2.2 Cluster analysis

We performed the cluster analysis on the session 2021 and 2020 and results are very similar. We decided to leave the two analysis but going through one is sufficient to see the method used and understand the results.

2021

To perform the cluster analysis, we filtered the session 2021 and selected the variables department_fr, session, school_id, external_students_rate, half_boarders_students_rate, boarding_students_rate from the joined data set. We then removed th first three and scaled the rest of the data.

The goal of this cluster analysis is to define clusters relative to the offering of each establishment. Through this analysis, we aim to split the establishments by offering. As we have 3 different offering it would not make sense to do too many clusters. To define the right number of cluster for our kmeans clustering, we used the elbow method and concluded that we need 4 clusters. The ratio between the between sum of square and the within sum of square is good at 81.5%.


fviz_nbclust(clust, kmeans, method = "wss") +
  geom_vline(xintercept = 4, linetype = 2) + #add line for better visualisation 
  labs(subtitle = "Elbow method")  #We can determine the optimal number of cluster, 4 clusters seems to be reasonable


# Compute k-means
km.res <- kmeans(clust, 4, nstart = 25)

# Visualize clusters using factoextra
fviz_cluster(km.res, clust,
             ggtheme = theme_ipsum(), 
             repel = TRUE)+
  scale_fill_viridis(discrete = TRUE)+
  scale_color_viridis(discrete = TRUE)

We can see that the four clusters are well defined with cluster one, two and four being differentiated by the x-axis and cluster 3 spanning wider on the x-axis and on the y-axis. One can see what the dimensions represent in the graph below. Cluster one and two have higher half boarder rate than group three and especially group four. Group four will represent establishment without any catering offerings for the students. Group three is establishments with at least some of the students in boarding schools.


fviz_pca_var(PCA(clust, graph = FALSE))

To be able to continue our analyisis on the cluster we just measured, we need to implement the data in the main data set. We gathered the cluster results in a new data frame then added the column of cluster to the data set used for the cluster and then joined it with inner_join to the main data set to keep the filter we had applied.


clust1 <- tibble(department_fr = names(km.res$cluster),
                            cluster = km.res$cluster)

clust_housing_dnb$cluster <- clust1$cluster

housing_dnb_2021 <- inner_join(x = housing_dnb, y = clust_housing_dnb)

To analyse the data by cluster we needed to summarise it by cluster. You can see the results in the table below with an added variable establishment which is the number of establishment present in each cluster.

clust_h_dnb_2021 <- housing_dnb_2021 %>% 
  group_by(cluster) %>% 
  summarise(establishment = n(),
            external_students_rate = mean(external_students_rate, na.rm = TRUE),
            half_boarders_students_rate = mean(half_boarders_students_rate, na.rm = TRUE), 
            boarding_students_rate = mean(boarding_students_rate, na.rm = TRUE),
            without_pct = mean(without_pct, na.rm = TRUE),
            AB_pct = mean(AB_pct, na.rm = TRUE),
            B_pct = mean(B_pct, na.rm = TRUE),
            TB_pct = mean(TB_pct, na.rm = TRUE),
            success_rate_pct = mean(success_rate_pct, na.rm = TRUE)) %>% 
  round(digits = 2)

datatable(clust_h_dnb_2021, options =list(scrollX = "300px"))

We visually represented the results in a spider chart using the radarchart function.

op <- par(mar=c(0, 0, 0, 0))

radar <- clust_h_dnb_2021 %>%
  select(-c(cluster, establishment))

# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis 
 radarchart(radar, axistype=1 , 
    #custom polygon
    pcol= viridis(4, alpha = 1) , pfcol= viridis(4, alpha = 0.2)  , plwd=1 , plty=1,
    #custom the grid
    cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
    #custom labels
    vlcex=0.4)#, title = "Radar Graph") 
 # Add a legend
 legend(x=1.2, y=-0.4, legend = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4"), bty = "n", pch=20 , col=viridis(4) , text.col = "grey", cex=0.4, pt.cex=1)

par(op)

For better clarity, we have also displayed each cluster individually.

# we need to set the margin and create two rows to display the graphs
op <- par(mar=c(0, 1, 1, 0),mfrow=c(2, 2))

##### Cluster 1

#filter cluster 1 
radar <- clust_h_dnb_2021 %>% 
 filter(cluster == 1)%>%
  select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis 
radarchart(radar, axistype=1 , 
    #custom polygon
    pcol= rgb( red = 0.267, green = 0.00392, blue = 0.329,  alpha = 0.5) , pfcol= rgb( red = 0.267, green = 0.00392, blue = 0.329,  alpha = 0.2)  , plwd=1 , plty=1,
    #custom the grid
    cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
    #custom labels
    vlcex=0.4, title = "Cluster 1", cex.main = 1 )

##### Cluster 2
radar <- clust_h_dnb_2021 %>% 
 filter(cluster == 2) %>% 
 select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis 
radarchart(radar, axistype=1 , 
    #custom polygon
    pcol= rgb( red = 0.192  , green = 0.40784, blue = 0.557,  alpha = 0.5) , pfcol= rgb( red = 0.192    , green = 0.40784, blue = 0.557,  alpha = 0.5)  , plwd=1 , plty=1,
    #custom the grid
    cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
    #custom labels
    vlcex=0.4, title = "Cluster 2", cex.main = 1 )


##### Cluster 3 
radar <- clust_h_dnb_2021 %>% 
 filter(cluster == 3) %>% 
  select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis 
radarchart(radar, axistype=1 , 
    #custom polygon
    pcol= rgb( red = 0.208  , green = 0.71765, blue = 0.475,  alpha = 0.5) , pfcol= rgb( red = 0.208    , green = 0.71765, blue = 0.475,  alpha = 0.5)  , plwd=1 , plty=1,
    #custom the grid
    cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
    #custom labels
    vlcex=0.4, title = "Cluster 3", cex.main = 1 )

##### Cluster 4 
radar <- clust_h_dnb_2021 %>% 
 filter(cluster == 4) %>% 
  select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis 
radarchart(radar, axistype=1 , 
    #custom polygon
    pcol= rgb( red = 0.992  , green = 0.90588, blue = 0.145,  alpha = 0.5), pfcol= rgb( red = 0.992 , green = 0.90588, blue = 0.145,  alpha = 0.5), plwd=1 , plty=1,
    #custom the grid
    cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
    #custom labels
    vlcex=0.4, title = "Cluster 4", cex.main = 1  )

par(op)

The cluster analysis show very interesting results. Our intuition from the regression analysis is confirmed as the success rate varies slightly between the four cluster. The highest achieving establishment are the one offering a canteen for lunch where the children stay at school playing with their friends. The lowest success rate is the group going home for lunch where each children is separated and each reality greatly differs. Indeed, some students will have a comforting environment at home whereas other will have a hard time at home or at someone else’s home.

The difference is more significant for some of the distinctions. The achieving rate for the distinction AB and B are rather similar across clusters. Student from an establishment of the fourth cluster are the one achieving the best results with 29.3% of them receiving the distinction TB and only 20.8% not receiving any distinction. The rates are much worse if your school has a boarding offer as 35% of students do not receive any distinction and only 10% get the TB distinction. This could be due that some of them are establishment for elite athletes which start to focus more and more on their sports.

2020

To perform the cluster analysis, we filtered the session 2021 and selected the variables department_fr, session, school_id, external_students_rate, half_boarders_students_rate, boarding_students_rate from the joined data set. We then removed th first three and scaled the rest of the data.

The goal of this cluster analysis is to define clusters relative to the offering of each establishment. Through this analysis, we aim to split the establishments by offering. As we have 3 different offering it would not make sense to do too many clusters. To define the right number of cluster for our kmeans clustering, we used the elbow method. For 2020, the deicision is not as clear as for 2021 and the hesitation is between three and four clusters. As we have four clusters in 2021, we decided to also take four cluster for the 2020 aalysis. The ratio between the between sum of square and the within sum of square is good at 81.1%.


fviz_nbclust(clust, kmeans, method = "wss") +
  geom_vline(xintercept = 4, linetype = 2) + #add line for better visualisation 
  labs(subtitle = "Elbow method")  #We can determine the optimal number of cluster, 4 clusters seems to be reasonable


# Compute k-means
km.res <- kmeans(clust, 4, nstart = 25)

# Visualize clusters using factoextra
fviz_cluster(km.res, clust,
             ggtheme = theme_ipsum(), 
             repel = TRUE)+
  scale_fill_viridis(discrete = TRUE)+
  scale_color_viridis(discrete = TRUE)

We can see that the four clusters are well defined with cluster one, three and four being differiated by the x-axis and cluster two spanning wider on the x-axis and on the y-axis. One can see what the dimensions represent in the graph below. Cluster three and four have higher half boarder rate than group two and especially group one. Group one will represent establishment without any catering offerings for the students. Group two is establishments with at least some of the students in boarding schools.


fviz_pca_var(PCA(clust, graph = FALSE))

To be able to continue our analyisis on the cluster we just measured, we need to implement the data in the main data set. We gathered the cluster data in a new data frame then added the column of cluster to the data set used for the cluster and then joined it with inner_join to the main data set to keep the filter we had applied.


clust1 <- tibble(department_fr = names(km.res$cluster),
                            cluster = km.res$cluster)

clust_housing_dnb$cluster <- clust1$cluster

housing_dnb_2020 <- inner_join(x = housing_dnb, y = clust_housing_dnb)

To analyse the data by cluster we needed to summarise it by cluster. You can see the results in the table below with an added variable establishment which is the number of establishment present in each cluster.

clust_h_dnb_2020 <- housing_dnb_2020 %>% 
  group_by(cluster) %>% 
  summarise(establishment = n(),
            external_students_rate = mean(external_students_rate, na.rm = TRUE),
            half_boarders_students_rate = mean(half_boarders_students_rate, na.rm = TRUE), 
            boarding_students_rate = mean(boarding_students_rate, na.rm = TRUE),
            without_pct = mean(without_pct, na.rm = TRUE),
            AB_pct = mean(AB_pct, na.rm = TRUE),
            B_pct = mean(B_pct, na.rm = TRUE),
            TB_pct = mean(TB_pct, na.rm = TRUE),
            success_rate_pct = mean(success_rate_pct, na.rm = TRUE)) %>% 
  round(digits = 2)

datatable(clust_h_dnb_2020, options =list(scrollX = "300px"))

We visually represented the results in a spider chart using the radarchart function.

op <- par(mar=c(0, 0, 0, 0))

radar <- clust_h_dnb_2020 %>%
  select(-c(cluster, establishment))

# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis 
 radarchart(radar, axistype=1 , 
    #custom polygon
    pcol= viridis(4, alpha = 1) , pfcol= viridis(4, alpha = 0.2)  , plwd=1 , plty=1,
    #custom the grid
    cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
    #custom labels
    vlcex=0.4)#, title = "Radar Graph") 
 # Add a legend
 legend(x=1.2, y=-0.4, legend = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4"), bty = "n", pch=20 , col=viridis(4) , text.col = "grey", cex=0.4, pt.cex=1)

par(op)

For better clarity, we have also displayed each cluster individually.

# we need to set the margin and create two rows to display the graphs
op <- par(mar=c(0, 1, 1, 0),mfrow=c(2, 2))

##### Cluster 1

#filter cluster 1 
radar <- clust_h_dnb_2020 %>% 
 filter(cluster == 1)%>%
  select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis 
radarchart(radar, axistype=1 , 
    #custom polygon
    pcol= rgb( red = 0.267, green = 0.00392, blue = 0.329,  alpha = 0.5) , pfcol= rgb( red = 0.267, green = 0.00392, blue = 0.329,  alpha = 0.2)  , plwd=1 , plty=1,
    #custom the grid
    cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
    #custom labels
    vlcex=0.4, title = "Cluster 1", cex.main = 1 )

##### Cluster 2
radar <- clust_h_dnb_2020 %>% 
 filter(cluster == 2) %>% 
 select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis 
radarchart(radar, axistype=1 , 
    #custom polygon
    pcol= rgb( red = 0.192  , green = 0.40784, blue = 0.557,  alpha = 0.5) , pfcol= rgb( red = 0.192    , green = 0.40784, blue = 0.557,  alpha = 0.5)  , plwd=1 , plty=1,
    #custom the grid
    cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
    #custom labels
    vlcex=0.4, title = "Cluster 2", cex.main = 1 )


##### Cluster 3 
radar <- clust_h_dnb_2020 %>% 
 filter(cluster == 3) %>% 
  select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis 
radarchart(radar, axistype=1 , 
    #custom polygon
    pcol= rgb( red = 0.208  , green = 0.71765, blue = 0.475,  alpha = 0.5) , pfcol= rgb( red = 0.208    , green = 0.71765, blue = 0.475,  alpha = 0.5)  , plwd=1 , plty=1,
    #custom the grid
    cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
    #custom labels
    vlcex=0.4, title = "Cluster 3", cex.main = 1 )

##### Cluster 4 
radar <- clust_h_dnb_2020 %>% 
 filter(cluster == 4) %>% 
  select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis 
radarchart(radar, axistype=1 , 
    #custom polygon
    pcol= rgb( red = 0.992  , green = 0.90588, blue = 0.145,  alpha = 0.5), pfcol= rgb( red = 0.992 , green = 0.90588, blue = 0.145,  alpha = 0.5), plwd=1 , plty=1,
    #custom the grid
    cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
    #custom labels
    vlcex=0.4, title = "Cluster 4", cex.main = 1  )

par(op)

The cluster analysis show very interesting results. Our intuition from the regression analysis is confirmed as the success rate varies between the four cluster. The highest achieving establishment are the one offering a canteen for lunch where the children stay at school playing with their friends. The lowest success rate is the group going home for lunch where each children is separated and each reality greatly differs. Indeed, some students will have a comforting environment at home whereas other will have a hard time at home or at someone else’s home. The difference is also quite marked for some of the distinctions. The achieving rate for the distinction AB and B are rather similar across clusters. Student from an establishment of the first cluster are the one achieving the best results with 33.3% of them receiving the distinction TB and only 17.9% not receiving any distinction. The rates are much worse if your school has a boarding offer as 26.4% of students do not receive any distinction and only 14.7% get the TB distinction. This could be due that some of them are establishment for elite athletes which start to focus more and more on their sports.

From this analysis, we can conclude that having lunch and sleeping at school or at home does have a small influence on the results of the Dimplome National du Brevet. Better results are achieved students in establishment offering the lunch. Going home for lunch does seem to hinder academic success. Boarding schools might not the best choice for students aiming for top honours.

5.2.3 Single-parent families.

The first analysis is at the national level.We want to use two plot types: a bar plot that represent total single parent families and a line plot to represent the success rate. Both plots use session for x-axis. Since the range of the success_rate plot is much lower than the single parent families plot, the use of a secondary y-axis with an adapted scale is recommended.

sp <- single_parent%>%
  select("session", "department_fr", "sing_par")
singpar_vs_dnb <- left_join(dnb_results_dep, sp, by = c("department_fr","session"))

singpar_vs_dnb%>%
  select(success_rate_pct, session, sing_par, department_fr)%>%
  group_by(session)%>%
  summarise(success_rate_pct = mean(success_rate_pct), 
            sing_par = sum (sing_par, na.rm = TRUE)) %>%
  ggplot()+
  geom_col(aes(x = session, y = sing_par)) +
  geom_line(aes(x = session, y = 30000*success_rate_pct), size = 1, color="blue", group = 1) +
  scale_y_continuous(sec.axis = sec_axis(~./30000, name = "Success rate")) +
  labs( x = "Session", y = "Single parent families" )+
  theme_ipsum()

Despite the constant increase in the number of parental families over time, DNB’s results do not appear to be influenced too much by these. Indeed, logically, the more the number of single parent families increase, the more the success rate should decrease. However, from 2007 to 2017, sing_par and success_rate both increased, and from 2017 to 2019, although sing_par continued to increase, the success_rate suffered a drop from 88.74% to 86.14%.

We now want to do an analysis of single parent families against the results of DNB at the departmental level. We will therefore do a linear regression and use singpar_vs_dnb, a dataframe showing the results of DNB and sing_par par department_fr.

ggplot(data = singpar_vs_dnb,
       mapping = aes(x = sing_par, y = success_rate_pct)) +
  labs(title = "Single-parent families VS success rate", x = "sing_par", y = "Success rate", ) +
  geom_point() +
  geom_smooth(method = lm,
              color = "blue",
              size = 0.3)+
  theme_ipsum()
#> Warning: Removed 960 rows containing non-finite values
#> (`stat_smooth()`).
#> Warning: Removed 960 rows containing missing values (`geom_point()`).
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database

regression_sp <- lm(data = singpar_vs_dnb, success_rate_pct ~ sing_par)
tab_model(regression_sp)
  success_rate_pct
Predictors Estimates CI p
(Intercept) 85.68 85.24 – 86.12 <0.001
sing par -0.00 -0.00 – -0.00 0.020
Observations 576
R2 / R2 adjusted 0.009 / 0.008

The 34 missing values are the overseas departments, which we did not have the information for success_rate. Rsquared is equal to 0.009, and p_value is equal to 0.0203 but according to the significance code, it is not significant. Since variance is not explained, the model does not fit our data. Therefore, success rate of a department does not depend on the number of single parent families in the department.

5.3 Has the COVID-19 pandemic impacted student performance?

In order to observe the influence of Covid on student performance, we are going to investigate the relationship between incidence rate and success rate of DNB in every department in France. To do so, we first had to do a left join between covid_in_school and dnb_results in order to create a new data frame called covid_vs_dnb, and then create a simple linear regression model.

c <- c("2020", "2021")
d<- dnb_results%>%
  select("session", "department_fr", "success_rate_pct")%>%
  filter(session %in% c)%>%
  group_by(department_fr, session)%>%
  summarise(success_rate_pct = mean(success_rate_pct))
co<- covid_in_schools%>%
  select("session", "department_fr", "incidence_rate")%>%
  group_by(department_fr, session)%>%
  summarise(incidence_rate = mean(incidence_rate))
covid_vs_dnb <- inner_join(d, co, by = c("session", "department_fr"))
ggplot(data = covid_vs_dnb,
       mapping = aes(x = incidence_rate, y = success_rate_pct)) +
  labs(title = "incidence-rate of Covid cases VS success rate", x = "Incidence-rate", y = "Success rate", ) +
  geom_point() +
  geom_smooth(method = lm,
              color = "blue",
              size = 0.3)

regression <- lm(data = covid_vs_dnb, success_rate_pct ~ incidence_rate)
tab_model(regression)
  success_rate_pct
Predictors Estimates CI p
(Intercept) 90.12 89.71 – 90.53 <0.001
incidence rate -0.01 -0.02 – -0.01 <0.001
Observations 192
R2 / R2 adjusted 0.220 / 0.216

According to the residual results, the regression is more precise for the high values than the low values. The p-value is close to 0, therefore significant. The model seems usable. However, R squared is around 0.2, meaning that it explains only 20% of the variation of the success rate. Therefore, the result here are not very conclusive.

We then tried to do a linear regression for 2020 and 2021 separately because we thought that the 2020 low values could have distorted our model.

2020
covid_vs_dnb2020 <- inner_join(d, co, by = c("session", "department_fr"))%>%
  filter(session ==2020)
ggplot(data = covid_vs_dnb2020,
       mapping = aes(x = incidence_rate, y = success_rate_pct)) +
  labs(title = "incidence-rate of Covid cases VS success rate", x = "Incidence-rate", y = "Success rate", ) +
  geom_point() +
  geom_smooth(method = lm,
              color = "blue",
              size = 0.3)
regression2020 <- lm(data = covid_vs_dnb2020, success_rate_pct ~ incidence_rate)
tab_model(regression2020)
  success_rate_pct
Predictors Estimates CI p
(Intercept) 90.12 89.60 – 90.64 <0.001
incidence rate 0.01 -0.17 – 0.20 0.880
Observations 96
R2 / R2 adjusted 0.000 / -0.010

2021
covid_vs_dnb2021 <- inner_join(d, co, by = c("session", "department_fr"))%>%
  filter(session ==2021)
ggplot(data = covid_vs_dnb2021,
       mapping = aes(x = incidence_rate, y = success_rate_pct)) +
  labs(title = "incidence-rate of Covid cases VS success rate", x = "Incidence-rate", y = "Success rate", ) +
  geom_point() +
  geom_smooth(method = lm,
              color = "blue",
              size = 0.3)
regression2021 <- lm(data = covid_vs_dnb2021, success_rate_pct ~ incidence_rate)
tab_model(regression2021)
  success_rate_pct
Predictors Estimates CI p
(Intercept) 89.51 88.00 – 91.02 <0.001
incidence rate -0.01 -0.02 – -0.00 0.046
Observations 96
R2 / R2 adjusted 0.042 / 0.031

From these regression results, we see that we cannot use these models to explain the influence of the incidence rate and the success rate. Consequently, covid seems to not have a big impact on the dnb results. This finding is not so surprising because of the implementation of continuous assessment. In fact, according to the article by l’Etudiant (Cojean, 2020), the French government has taken special measures for DNB exams following Covid: since 2020, graduates were awarded on the basis of continuous assessment, meaning no final exam.

Since our linear regressions were inconclusive, the next step would be to observe whether or not covid influenced the attribution of the distinction. It can be hypothesized that given the continuous assessment, the rating has been more lenient. This would therefore lead to better results, thus more students admitted with mentions. In view of the results for 2020, which did not vary greatly, we will mainly focus on the attribution of distinction in 2021.To do so, we created covid_vs_dnbmention.

m<- dnb_results%>%
  select("session", "department_fr", "TB_pct", "B_pct", "AB_pct", "admitted_without")%>%
  filter(session == 2021)%>%
  group_by(department_fr, session)%>%
  summarise(TB_pct=mean(TB_pct), B_pct=mean(B_pct), AB_pct=mean(AB_pct), admitted_without=mean(admitted_without))
co<- covid_in_schools%>%
  select("session", "department_fr", "incidence_rate")%>%
  group_by(department_fr, session)%>%
  summarise(incidence_rate = mean(incidence_rate))
covid_vs_dnbmention <- inner_join(m, co, by = c("session", "department_fr"))
Linear regression for each distinction
TB
ggplot(data = covid_vs_dnbmention,
         mapping = aes(x = incidence_rate, y = TB_pct)) +
  labs(title = 'Incidence-rate of Covid cases VS Distinction "Très Bien"', x = "Incidence-rate", y = "Percentage of TB", ) +
  geom_point() +
  geom_smooth(method = lm,
              color = "blue",
              size = 0.3)
#> Warning: Removed 2 rows containing non-finite values
#> (`stat_smooth()`).
#> Warning: Removed 2 rows containing missing values (`geom_point()`).
regression <- lm(data = covid_vs_dnbmention, TB_pct ~ incidence_rate)
tab_model(regression)
  TB_pct
Predictors Estimates CI p
(Intercept) 20.21 17.70 – 22.73 <0.001
incidence rate 0.02 0.00 – 0.04 0.016
Observations 94
R2 / R2 adjusted 0.062 / 0.052

B
ggplot(data = covid_vs_dnbmention,
         mapping = aes(x = incidence_rate, y = B_pct)) +
  labs(title = 'Incidence-rate of Covid cases VS Distinction "Bien"', x = "Incidence-rate", y = "Percentage of B", ) +
  geom_point() +
  geom_smooth(method = lm,
              color = "blue",
              size = 0.3)
#> Warning: Removed 2 rows containing non-finite values
#> (`stat_smooth()`).
#> Warning: Removed 2 rows containing missing values (`geom_point()`).
regression <- lm(data = covid_vs_dnbmention, B_pct ~ incidence_rate)
tab_model(regression)
  B_pct
Predictors Estimates CI p
(Intercept) 25.68 24.36 – 27.00 <0.001
incidence rate -0.00 -0.01 – 0.00 0.281
Observations 94
R2 / R2 adjusted 0.013 / 0.002

AB
ggplot(data = covid_vs_dnbmention,
         mapping = aes(x = incidence_rate, y = AB_pct)) +
  labs(title = 'incidence-rate of Covid cases VS Distinction "Assez Bien"', x = "Incidence-rate", y = "Percentage of AB", ) +
  geom_point() +
  geom_smooth(method = lm,
              color = "blue",
              size = 0.3)
#> Warning: Removed 2 rows containing non-finite values
#> (`stat_smooth()`).
#> Warning: Removed 2 rows containing missing values (`geom_point()`).
regression <- lm(data = covid_vs_dnbmention, AB_pct ~ incidence_rate)
tab_model(regression)
  AB_pct
Predictors Estimates CI p
(Intercept) 28.59 27.14 – 30.05 <0.001
incidence rate -0.02 -0.03 – -0.01 0.002
Observations 94
R2 / R2 adjusted 0.099 / 0.089

Without Distinction
ggplot(data = covid_vs_dnbmention,
       mapping = aes(x = incidence_rate, y = admitted_without)) +
  labs(title = "incidence-rate of Covid cases VS Admitted Without Distinction", x = "Incidence-rate", y = "Percentage of students admitted without distinction", ) +
  geom_point() +
  geom_smooth(method = lm,
              color = "blue",
              size = 0.3)
regression <- lm(data = covid_vs_dnbmention, admitted_without ~ incidence_rate )
tab_model(regression)
  admitted_without
Predictors Estimates CI p
(Intercept) 9.87 6.89 – 12.85 <0.001
incidence rate 0.05 0.03 – 0.07 <0.001
Observations 96
R2 / R2 adjusted 0.204 / 0.196

In view of the various linear regressions, as covid cases increased, the academic government seemed to have granted more easily the highest distinction and without distinction. In fact, the linear regression of distinction TB and admitted_without indicate moderate positive correlations. This therefore leads to think that the grading was more lenient for the departments where covid was very present. Nevertheless, it is difficult to interpret these results as they are statistically not significant and all Rsquared are extremely poor.

Since the linear regression does not allow us to say much about the influence of Covid on student success, we decided to do an analysis from clusters and see if we can get anything interesting results out of it. As for the cluster analysis performed for student_housing we first selected the variables, success_rate_pctand incidence_rate in this case. Then we found the right number of clusters using the elbow method and then performed a kmean cluster analysis.

cluster1<- covid_vs_dnb2021
cluster <- cluster1[-c(1,2)]
row.names(cluster) <- as.vector(t(cluster1[,1]))
cluster <- scale(cluster)


fviz_nbclust(cluster, kmeans, method = "wss") +
  geom_vline(xintercept = 5, linetype = 2) +
  labs(subtitle = "Elbow method")
km.res1 <- kmeans(cluster, 5, nstart = 25) #77.4%
fviz_cluster(km.res1, cluster,
             ggtheme = theme_ipsum(),
             repel = TRUE)+
  scale_fill_viridis(discrete = TRUE)+
  scale_color_viridis(discrete = TRUE)

We end up with five cluster with cluster 1 having the highest success rate and some of the smallest incidence rate and cluster 3 having the highest incidence rate some of the worst success rate. We will focus on these two clusters as they are the main outliers. To do the analysis, we included the clusters in the data set dnb_results_dep and summarized the data by cluster.

clust1 <- tibble(department_fr = names(km.res1$cluster),
                            cluster = km.res1$cluster)

clust_cov_dnb <- inner_join(x = dnb_results_dep, y = clust1)


clust_cov_dnb <- clust_cov_dnb %>% 
  filter(cluster %in% c(1,3)) %>% 
  group_by(cluster, session) %>% 
  summarise(department = n(),
            without_pct = mean(without_pct, na.rm = TRUE),
            AB_pct = mean(AB_pct, na.rm = TRUE),
            B_pct = mean(B_pct, na.rm = TRUE),
            TB_pct = mean(TB_pct, na.rm = TRUE),
            success_rate_pct = mean(success_rate_pct, na.rm = TRUE)) %>% 
  round(digits = 2)
  
datatable(clust_cov_dnb, options =list(scrollX = "300px"))

The results of the cluster shows that cluster 1 is performing better with a lower incidence rate. Before concluding on the fact that a low incidence rate lead to better results, we needed to see if the difference in success rate was truly induced by the Covid 19 or if it was that these departments were generally performing better. For this comparison, we plotted all their results in two separate graphs but first we had to pivot the data for all variables into one column.

clust_cov_dnb <- clust_cov_dnb %>%   
  pivot_longer(c(contains("pct")), 
               names_to = "Mention_type",
               values_to = "Rate")


p <- clust_cov_dnb %>% 
  filter(cluster == 1) %>% 
  ggplot(aes(x = session, y = Rate, group = Mention_type, color = Mention_type))+
  geom_line()+
  scale_color_viridis(discrete = TRUE) +
    labs(title = "Cluster 1", x = "Session", y = "Rate in %" ) +
    theme_ipsum()
ggplotly(p, tooltip = c("x" ,"y"))
p <- clust_cov_dnb %>% 
  filter(cluster == 3) %>% 
  ggplot(aes(x = session, y = Rate, group = Mention_type, color = Mention_type))+
  geom_line()+
  scale_color_viridis(discrete = TRUE) +
    labs(title = "Cluster 3", x = "Session", y = "Rate in %" ) +
    theme_ipsum()

ggplotly(p, tooltip = c("x" ,"y"))

We can see that the difference has not been induced by the Covid 19 as the success rate of cluster 1 has always been higher than the one of cluster 3. Moreover, we see the same pattern for the success rate as they both increased by around 5% in 2020 and slightly went down in 2021. Regarding the distinctions, one can see that it is fairly similar in terms of percentage and pattern apart from the students graduating without honours who used to be a bigger share of the admitted students in the departments belonging to cluster 3.

To conclude on the question regarding the influence of Covid 19 on the results of the Diplome national du Brevet, we have seen through linear regressions and a cluster analysis that Covid19 had clost to no impact the success rate and distinctions. We can assume that as the Covid and the exam were on a national level the government body adapted the scale to the results of the students.

6 Conclusion

6.1 Take home message

This project allowed us to explore and analyse the results of the Diplome National du Brevet on the 2006-2021 period and discover that political decisions have a far grater impact than any other socioeconomic factor. Indeed, in 2017, we clearly saw the impact of the reform “learn better to succeed better” brought by the minister of education. The distribution of the distinctions went from 45% without, 30% AB, 20% B and 5% TB to around 25% for each. The impact of political actions is clear for the distinctions but not so much for the success rate. It increased steadily over the time period. the only evidence of the impact of a policy action on success rate can be seen in 2020 where due to the Covid 19 politics decided to cancel the final exam and grade using continuous assessment.

To come to the conclusion that political action had a bigger impact on the results of the Diplome National du Brevet than socioeconomic factors, we analysed the impact of three of them. The type of housing and lunch place was one of them. The data on student housing allowed us to discover that this factor had an impact on the distinctions the student were more likely to receive. Indeed, a student with the possibility of going to the canteen is more likely to pass with the highest hounours and less likely to have none than the average student. Conversely, students living in boarding schools will typically underperform. Surprisingly, the type of housing and lunch place has no impact on the success rate. We also explored the impact of sport through establishment labelled Generation 2024. We first saw that despite being created around the Paris Olympic, establishments form all over the country took part in this initiative. We could not see any link between a variation of the results and the labeling or not of an establishment. This probably comes from the fact that the label is built around promotion of sport in sport associations and not doing sport in school meaning that it has a smaller impact on the children. The last socio-economical factor that we analysed was the family situation through a data set about single parent families. An astonishing results was that the rate of single parent families per student admitted is stable across all french departments and remains nearly unchanged over our analysis period. The result of the analysis of a potential impact on the DNB results are that single families do not seem to impact students’ performance.

We also analysed the potential impact of the COVID 19 pandemic on the DNB results. In our exploration of the data we found out that the Covid19 crisis impacted more heavily the eastern part of France in 2021. However, the lock-down guidelines were nationalized meaning that all students were playing with the same rules. Therefore, we found no significant impact of the number of cases per department on the results of the Diplome National du Brevet. The only impact of Covid 19, as previously discussed, can be seen on a national level.

6.2 Limitations

Our analysis has some clear limitations. For our analysis on single parent family we were only able to work on a departmental level and with data on the global population and not only on students passing the Diplome National du Brevet. This is problematic because it lead to some very large approximations at every level of the analysis. For the part on the impact of sport, we chose the data on the label Generation 2024. Labelled establishments have a certain duty to promote the Paris Olympics and sports in general but the objective of the label are quite large. It means that not all labelled establishments have been actively promoting the act of going out and practicing some sport on a regular basis. Having clearer objectives or using another label promoting sport activity could have lead to more conclusive results. In our analysis of the impact of Covid, we decided to take into account all cases during the year but not all patients reacted the same way leading to different impact on their performance. Moreover, the students were evaluated on the basis of continuous assesment, it would have been more conclusive to analyse the results of each assessment during the years than the final results.

6.3 Future work

For further researches, it would be interesting to analyse the impact of Covid and single parent families by establishment to have more definitive results. In the quest of finding socioeconomic factors influencing children one would have many more to analyse such as the wealth of the parents, the size of the house they live in, the number of sports clubs where they are registered, the number of hours they spend on screens, the number of kilometers traveled to go to schools. The list is clearly non-exhaustive and one could try to analyse any factor that could be related to the results of the DNB. Another very interesting research that should be done is an analysis comprising several parameters to identify the ones having the bigger impact. Such an analysis would allow the creation of a model to predict the results of the different estasblishments.

6.4 References